{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# About \n",
    "\n",
    "This notebook demonstrates classifiers, which are provided by __Reproducible experiment platform (REP)__ package. <br /> REP contains following classifiers\n",
    "* __scikit-learn__\n",
    "* __TMVA__ \n",
    "* __XGBoost__ \n",
    "Also classifiers from `hep_ml` (as any other `sklearn`-compatible classifiers may be used)\n",
    "\n",
    "### In this notebook we show the most simple way to\n",
    "* train classifier\n",
    "* build predictions \n",
    "* measure quality\n",
    "* combine metaclassifiers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### download particle identification Data Set from UCI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File `MiniBooNE_PID.txt' already there; not retrieving.\r\n"
     ]
    }
   ],
   "source": [
    "!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc MiniBooNE_PID.txt https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy, pandas\n",
    "from rep.utils import train_test_split\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\\s*', skiprows=[0], header=None, engine='python')\n",
    "labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)\n",
    "labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]\n",
    "data.columns = ['feature_{}'.format(key) for key in data.columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First rows of our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_0</th>\n",
       "      <th>feature_1</th>\n",
       "      <th>feature_2</th>\n",
       "      <th>feature_3</th>\n",
       "      <th>feature_4</th>\n",
       "      <th>feature_5</th>\n",
       "      <th>feature_6</th>\n",
       "      <th>feature_7</th>\n",
       "      <th>feature_8</th>\n",
       "      <th>feature_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feature_40</th>\n",
       "      <th>feature_41</th>\n",
       "      <th>feature_42</th>\n",
       "      <th>feature_43</th>\n",
       "      <th>feature_44</th>\n",
       "      <th>feature_45</th>\n",
       "      <th>feature_46</th>\n",
       "      <th>feature_47</th>\n",
       "      <th>feature_48</th>\n",
       "      <th>feature_49</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td> 2.59413</td>\n",
       "      <td> 0.468803</td>\n",
       "      <td>  20.6916</td>\n",
       "      <td> 0.322648</td>\n",
       "      <td> 0.009682</td>\n",
       "      <td> 0.374393</td>\n",
       "      <td> 0.803479</td>\n",
       "      <td> 0.896592</td>\n",
       "      <td> 3.59665</td>\n",
       "      <td> 0.249282</td>\n",
       "      <td>...</td>\n",
       "      <td> 101.174</td>\n",
       "      <td>-31.3730</td>\n",
       "      <td> 0.442259</td>\n",
       "      <td> 5.86453</td>\n",
       "      <td> 0.000000</td>\n",
       "      <td> 0.090519</td>\n",
       "      <td> 0.176909</td>\n",
       "      <td> 0.457585</td>\n",
       "      <td> 0.071769</td>\n",
       "      <td> 0.245996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 3.86388</td>\n",
       "      <td> 0.645781</td>\n",
       "      <td>  18.1375</td>\n",
       "      <td> 0.233529</td>\n",
       "      <td> 0.030733</td>\n",
       "      <td> 0.361239</td>\n",
       "      <td> 1.069740</td>\n",
       "      <td> 0.878714</td>\n",
       "      <td> 3.59243</td>\n",
       "      <td> 0.200793</td>\n",
       "      <td>...</td>\n",
       "      <td> 186.516</td>\n",
       "      <td> 45.9597</td>\n",
       "      <td>-0.478507</td>\n",
       "      <td> 6.11126</td>\n",
       "      <td> 0.001182</td>\n",
       "      <td> 0.091800</td>\n",
       "      <td>-0.465572</td>\n",
       "      <td> 0.935523</td>\n",
       "      <td> 0.333613</td>\n",
       "      <td> 0.230621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td> 3.38584</td>\n",
       "      <td> 1.197140</td>\n",
       "      <td>  36.0807</td>\n",
       "      <td> 0.200866</td>\n",
       "      <td> 0.017341</td>\n",
       "      <td> 0.260841</td>\n",
       "      <td> 1.108950</td>\n",
       "      <td> 0.884405</td>\n",
       "      <td> 3.43159</td>\n",
       "      <td> 0.177167</td>\n",
       "      <td>...</td>\n",
       "      <td> 129.931</td>\n",
       "      <td>-11.5608</td>\n",
       "      <td>-0.297008</td>\n",
       "      <td> 8.27204</td>\n",
       "      <td> 0.003854</td>\n",
       "      <td> 0.141721</td>\n",
       "      <td>-0.210559</td>\n",
       "      <td> 1.013450</td>\n",
       "      <td> 0.255512</td>\n",
       "      <td> 0.180901</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td> 4.28524</td>\n",
       "      <td> 0.510155</td>\n",
       "      <td> 674.2010</td>\n",
       "      <td> 0.281923</td>\n",
       "      <td> 0.009174</td>\n",
       "      <td> 0.000000</td>\n",
       "      <td> 0.998822</td>\n",
       "      <td> 0.823390</td>\n",
       "      <td> 3.16382</td>\n",
       "      <td> 0.171678</td>\n",
       "      <td>...</td>\n",
       "      <td> 163.978</td>\n",
       "      <td>-18.4586</td>\n",
       "      <td> 0.453886</td>\n",
       "      <td> 2.48112</td>\n",
       "      <td> 0.000000</td>\n",
       "      <td> 0.180938</td>\n",
       "      <td> 0.407968</td>\n",
       "      <td> 4.341270</td>\n",
       "      <td> 0.473081</td>\n",
       "      <td> 0.258990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td> 5.93662</td>\n",
       "      <td> 0.832993</td>\n",
       "      <td>  59.8796</td>\n",
       "      <td> 0.232853</td>\n",
       "      <td> 0.025066</td>\n",
       "      <td> 0.233556</td>\n",
       "      <td> 1.370040</td>\n",
       "      <td> 0.787424</td>\n",
       "      <td> 3.66546</td>\n",
       "      <td> 0.174862</td>\n",
       "      <td>...</td>\n",
       "      <td> 229.555</td>\n",
       "      <td> 42.9600</td>\n",
       "      <td>-0.975752</td>\n",
       "      <td> 2.66109</td>\n",
       "      <td> 0.000000</td>\n",
       "      <td> 0.170836</td>\n",
       "      <td>-0.814403</td>\n",
       "      <td> 4.679490</td>\n",
       "      <td> 1.924990</td>\n",
       "      <td> 0.253893</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 50 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  \\\n",
       "0    2.59413   0.468803    20.6916   0.322648   0.009682   0.374393   \n",
       "1    3.86388   0.645781    18.1375   0.233529   0.030733   0.361239   \n",
       "2    3.38584   1.197140    36.0807   0.200866   0.017341   0.260841   \n",
       "3    4.28524   0.510155   674.2010   0.281923   0.009174   0.000000   \n",
       "4    5.93662   0.832993    59.8796   0.232853   0.025066   0.233556   \n",
       "\n",
       "   feature_6  feature_7  feature_8  feature_9    ...      feature_40  \\\n",
       "0   0.803479   0.896592    3.59665   0.249282    ...         101.174   \n",
       "1   1.069740   0.878714    3.59243   0.200793    ...         186.516   \n",
       "2   1.108950   0.884405    3.43159   0.177167    ...         129.931   \n",
       "3   0.998822   0.823390    3.16382   0.171678    ...         163.978   \n",
       "4   1.370040   0.787424    3.66546   0.174862    ...         229.555   \n",
       "\n",
       "   feature_41  feature_42  feature_43  feature_44  feature_45  feature_46  \\\n",
       "0    -31.3730    0.442259     5.86453    0.000000    0.090519    0.176909   \n",
       "1     45.9597   -0.478507     6.11126    0.001182    0.091800   -0.465572   \n",
       "2    -11.5608   -0.297008     8.27204    0.003854    0.141721   -0.210559   \n",
       "3    -18.4586    0.453886     2.48112    0.000000    0.180938    0.407968   \n",
       "4     42.9600   -0.975752     2.66109    0.000000    0.170836   -0.814403   \n",
       "\n",
       "   feature_47  feature_48  feature_49  \n",
       "0    0.457585    0.071769    0.245996  \n",
       "1    0.935523    0.333613    0.230621  \n",
       "2    1.013450    0.255512    0.180901  \n",
       "3    4.341270    0.473081    0.258990  \n",
       "4    4.679490    1.924990    0.253893  \n",
       "\n",
       "[5 rows x 50 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get train and test data\n",
    "train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifiers\n",
    "\n",
    "All classifiers inherit from __sklearn.BaseEstimator__ and have the following methods:\n",
    "    \n",
    "* `classifier.fit(X, y, sample_weight=None)` - train classifier\n",
    "        \n",
    "* `classifier.predict_proba(X)` - return probabilities vector for all classes\n",
    "\n",
    "* `classifier.predict(X)` - return predicted labels\n",
    "\n",
    "* `classifier.staged_predict_proba(X)` - return probabilities after each iteration (not supported by TMVA)\n",
    "\n",
    "* `classifier.get_feature_importances()`\n",
    "\n",
    "\n",
    "Here we use `X` to denote matrix with data of shape `[n_samples, n_features]`, `y` is vector with labels (0 or 1) of shape [n_samples], <br /> `sample_weight` is vector with weights.\n",
    "\n",
    "\n",
    "## Difference from default scikit-learn interface\n",
    "X should be* `pandas.DataFrame`, `not numpy.array`. <br />\n",
    "Provided this, you'll be able to choose features used in training by setting e.g. `features=['FlightTime', 'p']` in constructor.\n",
    "\n",
    "\\* it works fine with `numpy.array` as well, but in this case all the features will be used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variables used in training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "variables = list(data.columns[:26])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sklearn\n",
    "wrapper for scikit-learn classifiers. In this example we use GradientBoosting with default settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    }
   ],
   "source": [
    "from rep.estimators import SklearnClassifier\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "# Using gradient boosting with default settings\n",
    "sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)\n",
    "# Training classifier\n",
    "sk.fit(train_data, train_labels)\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predicting probabilities, measuring the quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.5944498   0.4055502 ]\n",
      " [ 0.81916607  0.18083393]\n",
      " [ 0.98075742  0.01924258]\n",
      " ..., \n",
      " [ 0.99441842  0.00558158]\n",
      " [ 0.98991082  0.01008918]\n",
      " [ 0.95889323  0.04110677]]\n"
     ]
    }
   ],
   "source": [
    "# predict probabilities for each class\n",
    "prob = sk.predict_proba(test_data)\n",
    "print prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC 0.974129793073\n"
     ]
    }
   ],
   "source": [
    "print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predictions of classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sk.predict(test_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>effect</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>feature_0</th>\n",
       "      <td> 0.111046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_1</th>\n",
       "      <td> 0.061314</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_2</th>\n",
       "      <td> 0.073994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_3</th>\n",
       "      <td> 0.048287</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_4</th>\n",
       "      <td> 0.017852</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_5</th>\n",
       "      <td> 0.037825</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_6</th>\n",
       "      <td> 0.017941</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_7</th>\n",
       "      <td> 0.003212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_8</th>\n",
       "      <td> 0.017700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_9</th>\n",
       "      <td> 0.007857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_10</th>\n",
       "      <td> 0.009817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_11</th>\n",
       "      <td> 0.025641</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_12</th>\n",
       "      <td> 0.137319</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_13</th>\n",
       "      <td> 0.019235</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_14</th>\n",
       "      <td> 0.001395</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_15</th>\n",
       "      <td> 0.040681</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_16</th>\n",
       "      <td> 0.150452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_17</th>\n",
       "      <td> 0.043220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_18</th>\n",
       "      <td> 0.001098</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_19</th>\n",
       "      <td> 0.041260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_20</th>\n",
       "      <td> 0.035487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_21</th>\n",
       "      <td> 0.003610</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_22</th>\n",
       "      <td> 0.026011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_23</th>\n",
       "      <td> 0.022763</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_24</th>\n",
       "      <td> 0.015564</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_25</th>\n",
       "      <td> 0.029417</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              effect\n",
       "feature_0   0.111046\n",
       "feature_1   0.061314\n",
       "feature_2   0.073994\n",
       "feature_3   0.048287\n",
       "feature_4   0.017852\n",
       "feature_5   0.037825\n",
       "feature_6   0.017941\n",
       "feature_7   0.003212\n",
       "feature_8   0.017700\n",
       "feature_9   0.007857\n",
       "feature_10  0.009817\n",
       "feature_11  0.025641\n",
       "feature_12  0.137319\n",
       "feature_13  0.019235\n",
       "feature_14  0.001395\n",
       "feature_15  0.040681\n",
       "feature_16  0.150452\n",
       "feature_17  0.043220\n",
       "feature_18  0.001098\n",
       "feature_19  0.041260\n",
       "feature_20  0.035487\n",
       "feature_21  0.003610\n",
       "feature_22  0.026011\n",
       "feature_23  0.022763\n",
       "feature_24  0.015564\n",
       "feature_25  0.029417"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sk.get_feature_importances()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TMVA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    TMVAClassifier wraps classifiers from TMVA (CERN library for machine learning)\n",
      "\n",
      "    Parameters:\n",
      "    -----------\n",
      "    :param str method: algorithm method (default='kBDT')\n",
      "    :param features: features used in training\n",
      "    :type features: list[str] or None\n",
      "    :param str factory_options: options, for example::\n",
      "\n",
      "        \"!V:!Silent:Color:Transformations=I;D;P;G,D\"\n",
      "\n",
      "    :param str sigmoid_function: function which is used to convert TMVA output to probabilities;\n",
      "\n",
      "        * *identity* (use for svm, mlp) --- the same output, use this for methods returning class probabilities\n",
      "\n",
      "        * *sigmoid* --- sigmoid transformation, use it if output varies in range [-infinity, +infinity]\n",
      "\n",
      "        * *bdt* (for bdt algorithms output varies in range [-1, 1])\n",
      "\n",
      "        * *sig_eff=0.4* --- for rectangular cut optimization methods,\n",
      "        for instance, here 0.4 will be used as signal efficiency to evaluate MVA,\n",
      "        (put any float number from [0, 1])\n",
      "\n",
      "    :param dict method_parameters: estimator options, example: NTrees=100, BoostType='Grad'\n",
      "\n",
      "    .. warning::\n",
      "        TMVA doesn't support *staged_predict_proba()* and *feature_importances__*\n",
      "\n",
      "    .. warning::\n",
      "        TMVA doesn't support multiclassification, only two-class classification\n",
      "\n",
      "    `TMVA guide <http://mirror.yandex.ru/gentoo-distfiles/distfiles/TMVAUsersGuide-v4.03.pdf>`_\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "from rep.estimators import TMVAClassifier\n",
    "print TMVAClassifier.__doc__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    }
   ],
   "source": [
    "tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)\n",
    "tmva.fit(train_data, train_labels)\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict probabilities and estimate quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.50350769  0.49649231]\n",
      " [ 0.63357498  0.36642502]\n",
      " [ 0.67613779  0.32386221]\n",
      " ..., \n",
      " [ 0.85069637  0.14930363]\n",
      " [ 0.89332189  0.10667811]\n",
      " [ 0.59580044  0.40419956]]\n"
     ]
    }
   ],
   "source": [
    "# predict probabilities for each class\n",
    "prob = tmva.predict_proba(test_data)\n",
    "print prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC 0.96226347017\n"
     ]
    }
   ],
   "source": [
    "print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predict labels\n",
    "tmva.predict(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XGBoost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    Implements classification (and multiclassification) from XGBoost library.\n",
      "\n",
      "    Parameters:\n",
      "    -----------\n",
      "    :param features: list of features to train model\n",
      "    :type features: None or list(str)\n",
      "    :param int n_estimators: the number of trees built.\n",
      "    :param int nthreads: number of parallel threads used to run xgboost.\n",
      "    :param num_feature: feature dimension used in boosting, set to maximum dimension of the feature\n",
      "        (set automatically by xgboost, no need to be set by user).\n",
      "    :type num_feature: None or int\n",
      "    :param float gamma: minimum loss reduction required to make a further partition on a leaf node of the tree.\n",
      "        The larger, the more conservative the algorithm will be.\n",
      "    :type gamma: None or float\n",
      "    :param float eta: step size shrinkage used in update to prevent overfitting.\n",
      "        After each boosting step, we can directly get the weights of new features\n",
      "        and eta actually shrinkage the feature weights to make the boosting process more conservative.\n",
      "    :param int max_depth: maximum depth of a tree.\n",
      "    :param float scale_pos_weight: ration of weights of the class 1 to the weights of the class 0.\n",
      "    :param float min_child_weight: minimum sum of instance weight(hessian) needed in a child.\n",
      "        If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,\n",
      "        then the building process will give up further partitioning.\n",
      "\n",
      "        .. note:: weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.\n",
      "    :param float subsample: subsample ratio of the training instance.\n",
      "        Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees\n",
      "        and this will prevent overfitting.\n",
      "    :param float colsample: subsample ratio of columns when constructing each tree.\n",
      "    :param float base_score: the initial prediction score of all instances, global bias.\n",
      "    :param int random_state: random number seed.\n",
      "    :param boot verbose: if 1, will print messages during training\n",
      "    :param float missing: the number considered by xgboost as missing value.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "from rep.estimators import XGBoostClassifier\n",
    "print XGBoostClassifier.__doc__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/antares/code/xgboost/wrapper/xgboost.py:80: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.\n",
      "  if label != None:\n",
      "/Users/antares/code/xgboost/wrapper/xgboost.py:82: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.\n",
      "  if weight !=None:\n"
     ]
    }
   ],
   "source": [
    "# XGBoost with default parameters\n",
    "xgb = XGBoostClassifier(features=variables)\n",
    "xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict probabilities and estimate quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC: 0.980127833042\n"
     ]
    }
   ],
   "source": [
    "prob = xgb.predict_proba(test_data)\n",
    "print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 0, 0, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xgb.predict(test_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>effect</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>feature_0</th>\n",
       "      <td> 526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_1</th>\n",
       "      <td> 400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_2</th>\n",
       "      <td> 512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_3</th>\n",
       "      <td> 412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_4</th>\n",
       "      <td> 356</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_5</th>\n",
       "      <td> 478</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_6</th>\n",
       "      <td> 344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_7</th>\n",
       "      <td> 350</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_8</th>\n",
       "      <td> 376</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_9</th>\n",
       "      <td> 384</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_10</th>\n",
       "      <td> 364</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_11</th>\n",
       "      <td> 450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_12</th>\n",
       "      <td> 644</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_13</th>\n",
       "      <td> 414</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_14</th>\n",
       "      <td> 276</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_15</th>\n",
       "      <td> 332</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_16</th>\n",
       "      <td> 338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_17</th>\n",
       "      <td> 448</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_18</th>\n",
       "      <td> 294</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_19</th>\n",
       "      <td> 404</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_20</th>\n",
       "      <td> 346</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_21</th>\n",
       "      <td> 300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_22</th>\n",
       "      <td> 266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_23</th>\n",
       "      <td> 476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_24</th>\n",
       "      <td> 330</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_25</th>\n",
       "      <td> 398</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            effect\n",
       "feature_0      526\n",
       "feature_1      400\n",
       "feature_2      512\n",
       "feature_3      412\n",
       "feature_4      356\n",
       "feature_5      478\n",
       "feature_6      344\n",
       "feature_7      350\n",
       "feature_8      376\n",
       "feature_9      384\n",
       "feature_10     364\n",
       "feature_11     450\n",
       "feature_12     644\n",
       "feature_13     414\n",
       "feature_14     276\n",
       "feature_15     332\n",
       "feature_16     338\n",
       "feature_17     448\n",
       "feature_18     294\n",
       "feature_19     404\n",
       "feature_20     346\n",
       "feature_21     300\n",
       "feature_22     266\n",
       "feature_23     476\n",
       "feature_24     330\n",
       "feature_25     398"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xgb.get_feature_importances()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advantages of common interface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As one can see above, all the classifiers implement the same interface, \n",
    "this simplifies work, simplifies comparison of different classifiers, \n",
    "but this is not the only profit. \n",
    "\n",
    "`Sklearn` provides different tools to combine different classifiers and transformers. \n",
    "One of this tools is `AdaBoost`, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree)\n",
    "\n",
    "Let's show that now you can run AdaBoost over classifiers from other libraries! <br />\n",
    "_(isn't boosting over neural network what you were dreaming of all your life?)_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AdaBoost over TMVA classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "\n",
    "# Construct AdaBoost with TMVA as base estimator\n",
    "base_tmva = TMVAClassifier(method='kBDT', NTrees=15, Shrinkage=0.05)\n",
    "ada_tmva  = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)\n",
    "ada_tmva.fit(train_data, train_labels)\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AUC 0.964513946309\n"
     ]
    }
   ],
   "source": [
    "prob = ada_tmva.predict_proba(test_data)\n",
    "print 'AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AdaBoost over XGBoost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete!\n"
     ]
    }
   ],
   "source": [
    "# Construct AdaBoost with xgboost base estimator\n",
    "base_xgb = XGBoostClassifier(n_estimators=50)\n",
    "# ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1), features=variables)\n",
    "ada_xgb = AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1)\n",
    "ada_xgb.fit(train_data[variables], train_labels)\n",
    "print('training complete!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AUC 0.979708882619\n"
     ]
    }
   ],
   "source": [
    "# predict probabilities for each class\n",
    "prob = ada_xgb.predict_proba(test_data[variables])\n",
    "print 'AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AUC 0.99289214177\n"
     ]
    }
   ],
   "source": [
    "# predict probabilities for each class\n",
    "prob = ada_xgb.predict_proba(train_data[variables])\n",
    "print 'AUC', roc_auc_score(train_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Other advantages of common interface\n",
    "There are many things you can do with classifiers now: \n",
    "* cloning\n",
    "* getting / setting parameters as dictionaries \n",
    "* do automatic hyperparameter optimization \n",
    "* build pipelines (`sklearn.pipeline`)\n",
    "* use hierarchical training, training on subsets\n",
    "* passing over internet / train classifiers on other machines\n",
    "\n",
    "And you can replace classifiers at any moment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "### Exercise 1. Play with parameters in each type of classifiers\n",
    "\n",
    "### Exercise 2. Add weight column and train models with weights in training"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}