{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# About \n",
    "\n",
    "This notebook demonstrates neural networks (NN) classifiers, which are provided by __Reproducible experiment platform (REP)__ package. <br /> REP contains wrappers for following NN libraries:\n",
    "* __theanets__\n",
    "* __neurolab__ \n",
    "* __pybrain__ \n",
    "\n",
    "\n",
    "### In this notebook we show: \n",
    "* train classifier\n",
    "* get predictions \n",
    "* measure quality\n",
    "* pretraining and partial fitting\n",
    "* combine classifiers using meta-algorithms\n",
    "\n",
    "Most of this is done in the same way as for other classifiers (see notebook [01-howto-Classifiers.ipynb](https://github.com/yandex/rep/blob/master/howto/01-howto-Classifiers.ipynb)). \n",
    "\n",
    "Parameters selected here are specially taken to make training very fast, those are very non-optimal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### download particle identification data set from UCI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File `MiniBooNE_PID.txt' already there; not retrieving.\r\n"
     ]
    }
   ],
   "source": [
    "!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy, pandas\n",
    "from rep.utils import train_test_split\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\\s*', skiprows=[0], header=None, engine='python')\n",
    "labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)\n",
    "labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]\n",
    "data.columns = ['feature_{}'.format(key) for key in data.columns]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "130064"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First rows of  data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_0</th>\n",
       "      <th>feature_1</th>\n",
       "      <th>feature_2</th>\n",
       "      <th>feature_3</th>\n",
       "      <th>feature_4</th>\n",
       "      <th>feature_5</th>\n",
       "      <th>feature_6</th>\n",
       "      <th>feature_7</th>\n",
       "      <th>feature_8</th>\n",
       "      <th>feature_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feature_40</th>\n",
       "      <th>feature_41</th>\n",
       "      <th>feature_42</th>\n",
       "      <th>feature_43</th>\n",
       "      <th>feature_44</th>\n",
       "      <th>feature_45</th>\n",
       "      <th>feature_46</th>\n",
       "      <th>feature_47</th>\n",
       "      <th>feature_48</th>\n",
       "      <th>feature_49</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.59413</td>\n",
       "      <td>0.468803</td>\n",
       "      <td>20.6916</td>\n",
       "      <td>0.322648</td>\n",
       "      <td>0.009682</td>\n",
       "      <td>0.374393</td>\n",
       "      <td>0.803479</td>\n",
       "      <td>0.896592</td>\n",
       "      <td>3.59665</td>\n",
       "      <td>0.249282</td>\n",
       "      <td>...</td>\n",
       "      <td>101.174</td>\n",
       "      <td>-31.3730</td>\n",
       "      <td>0.442259</td>\n",
       "      <td>5.86453</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.090519</td>\n",
       "      <td>0.176909</td>\n",
       "      <td>0.457585</td>\n",
       "      <td>0.071769</td>\n",
       "      <td>0.245996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.86388</td>\n",
       "      <td>0.645781</td>\n",
       "      <td>18.1375</td>\n",
       "      <td>0.233529</td>\n",
       "      <td>0.030733</td>\n",
       "      <td>0.361239</td>\n",
       "      <td>1.069740</td>\n",
       "      <td>0.878714</td>\n",
       "      <td>3.59243</td>\n",
       "      <td>0.200793</td>\n",
       "      <td>...</td>\n",
       "      <td>186.516</td>\n",
       "      <td>45.9597</td>\n",
       "      <td>-0.478507</td>\n",
       "      <td>6.11126</td>\n",
       "      <td>0.001182</td>\n",
       "      <td>0.091800</td>\n",
       "      <td>-0.465572</td>\n",
       "      <td>0.935523</td>\n",
       "      <td>0.333613</td>\n",
       "      <td>0.230621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.38584</td>\n",
       "      <td>1.197140</td>\n",
       "      <td>36.0807</td>\n",
       "      <td>0.200866</td>\n",
       "      <td>0.017341</td>\n",
       "      <td>0.260841</td>\n",
       "      <td>1.108950</td>\n",
       "      <td>0.884405</td>\n",
       "      <td>3.43159</td>\n",
       "      <td>0.177167</td>\n",
       "      <td>...</td>\n",
       "      <td>129.931</td>\n",
       "      <td>-11.5608</td>\n",
       "      <td>-0.297008</td>\n",
       "      <td>8.27204</td>\n",
       "      <td>0.003854</td>\n",
       "      <td>0.141721</td>\n",
       "      <td>-0.210559</td>\n",
       "      <td>1.013450</td>\n",
       "      <td>0.255512</td>\n",
       "      <td>0.180901</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.28524</td>\n",
       "      <td>0.510155</td>\n",
       "      <td>674.2010</td>\n",
       "      <td>0.281923</td>\n",
       "      <td>0.009174</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.998822</td>\n",
       "      <td>0.823390</td>\n",
       "      <td>3.16382</td>\n",
       "      <td>0.171678</td>\n",
       "      <td>...</td>\n",
       "      <td>163.978</td>\n",
       "      <td>-18.4586</td>\n",
       "      <td>0.453886</td>\n",
       "      <td>2.48112</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.180938</td>\n",
       "      <td>0.407968</td>\n",
       "      <td>4.341270</td>\n",
       "      <td>0.473081</td>\n",
       "      <td>0.258990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.93662</td>\n",
       "      <td>0.832993</td>\n",
       "      <td>59.8796</td>\n",
       "      <td>0.232853</td>\n",
       "      <td>0.025066</td>\n",
       "      <td>0.233556</td>\n",
       "      <td>1.370040</td>\n",
       "      <td>0.787424</td>\n",
       "      <td>3.66546</td>\n",
       "      <td>0.174862</td>\n",
       "      <td>...</td>\n",
       "      <td>229.555</td>\n",
       "      <td>42.9600</td>\n",
       "      <td>-0.975752</td>\n",
       "      <td>2.66109</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.170836</td>\n",
       "      <td>-0.814403</td>\n",
       "      <td>4.679490</td>\n",
       "      <td>1.924990</td>\n",
       "      <td>0.253893</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 50 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  \\\n",
       "0    2.59413   0.468803    20.6916   0.322648   0.009682   0.374393   \n",
       "1    3.86388   0.645781    18.1375   0.233529   0.030733   0.361239   \n",
       "2    3.38584   1.197140    36.0807   0.200866   0.017341   0.260841   \n",
       "3    4.28524   0.510155   674.2010   0.281923   0.009174   0.000000   \n",
       "4    5.93662   0.832993    59.8796   0.232853   0.025066   0.233556   \n",
       "\n",
       "   feature_6  feature_7  feature_8  feature_9     ...      feature_40  \\\n",
       "0   0.803479   0.896592    3.59665   0.249282     ...         101.174   \n",
       "1   1.069740   0.878714    3.59243   0.200793     ...         186.516   \n",
       "2   1.108950   0.884405    3.43159   0.177167     ...         129.931   \n",
       "3   0.998822   0.823390    3.16382   0.171678     ...         163.978   \n",
       "4   1.370040   0.787424    3.66546   0.174862     ...         229.555   \n",
       "\n",
       "   feature_41  feature_42  feature_43  feature_44  feature_45  feature_46  \\\n",
       "0    -31.3730    0.442259     5.86453    0.000000    0.090519    0.176909   \n",
       "1     45.9597   -0.478507     6.11126    0.001182    0.091800   -0.465572   \n",
       "2    -11.5608   -0.297008     8.27204    0.003854    0.141721   -0.210559   \n",
       "3    -18.4586    0.453886     2.48112    0.000000    0.180938    0.407968   \n",
       "4     42.9600   -0.975752     2.66109    0.000000    0.170836   -0.814403   \n",
       "\n",
       "   feature_47  feature_48  feature_49  \n",
       "0    0.457585    0.071769    0.245996  \n",
       "1    0.935523    0.333613    0.230621  \n",
       "2    1.013450    0.255512    0.180901  \n",
       "3    4.341270    0.473081    0.258990  \n",
       "4    4.679490    1.924990    0.253893  \n",
       "\n",
       "[5 rows x 50 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get train and test data\n",
    "train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural nets\n",
    "\n",
    "All nets inherit from __sklearn.BaseEstimator__ and have the same interface as another wrappers in REP (details see in **01-howto-Classifiers**)\n",
    "\n",
    "Neurla network libraries libraries **support**:\n",
    "\n",
    "* classification\n",
    "* multi-classification\n",
    "* regression\n",
    "* multi-target regresssion\n",
    "* additional fitting (using `partial_fit` method)\n",
    "\n",
    "and **don't support**:\n",
    "\n",
    "* staged prediction methods\n",
    "* weights for data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variables used in training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "variables = list(data.columns[:15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Theanets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Classifier from Theanets library. \n",
      "\n",
      "    Parameters:\n",
      "    -----------\n",
      "    :param features: list of features to train model\n",
      "    :type features: None or list(str)\n",
      "    :param layers: a sequence of values specifying the **hidden** layer configuration for the network.\n",
      "        For more information please see 'Specifying layers' in theanets documentation:\n",
      "        http://theanets.readthedocs.org/en/latest/creating.html#creating-specifying-layers\n",
      "        Note that theanets \"layers\" parameter included input and output layers in the sequence as well.\n",
      "    :type layers: sequence of int, tuple, dict\n",
      "    :param int input_layer: size of the input layer. If equals -1, the size is taken from the training dataset\n",
      "    :param int output_layer: size of the output layer. If equals -1, the size is taken from the training dataset\n",
      "    :param str hidden_activation: the name of an activation function to use on hidden network layers by default\n",
      "    :param str output_activation: the name of an activation function to use on the output layer by default\n",
      "    :param float input_noise: standard deviation of desired noise to inject into input\n",
      "    :param float hidden_noise: standard deviation of desired noise to inject into hidden unit activation output\n",
      "    :param input_dropouts: proportion of input units to randomly set to 0\n",
      "    :type input_dropouts: float in [0, 1]\n",
      "    :param hidden_dropouts: proportion of hidden unit activations to randomly set to 0\n",
      "    :type hidden_dropouts: float in [0, 1]\n",
      "    :param decode_from: any of the hidden layers can be tapped at the output. Just specify a value greater than\n",
      "        1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer\n",
      "    :type decode_from: positive int\n",
      "    :param scaler: scaler used to transform data. If False, scaling will not be used\n",
      "    :type scaler: str or sklearn-like transformer or False (do not scale features)\n",
      "    :param trainers: parameters to specify training algorithm(s)\n",
      "        example: [{'optimize': sgd, 'momentum': 0.2}, {'optimize': 'nag'}]\n",
      "    :type trainers: list[dict] or None\n",
      "    :param int random_state: random seed\n",
      "\n",
      "\n",
      "    For more information on available trainers and their parameters, see this page\n",
      "    http://theanets.readthedocs.org/en/latest/training.html\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "from rep.estimators import TheanetsClassifier\n",
    "print TheanetsClassifier.__doc__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simple training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tn = TheanetsClassifier(features=variables, layers=[7], \n",
    "                        trainers=[{'optimize': 'nag', 'learning_rate': 0.1, 'min_improvement': 0.1}])\n",
    "\n",
    "tn.fit(train_data, train_labels)\n",
    "pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predicting probabilities, measuring the quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.26320391  0.73679609]\n",
      " [ 0.81044349  0.18955651]\n",
      " [ 0.40544071  0.59455929]\n",
      " ..., \n",
      " [ 0.90087309  0.09912691]\n",
      " [ 0.86900052  0.13099948]\n",
      " [ 0.90821799  0.09178201]]\n"
     ]
    }
   ],
   "source": [
    "prob = tn.predict_proba(test_data)\n",
    "print prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC 0.843440299528\n"
     ]
    }
   ],
   "source": [
    "print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Theanets multistage training \n",
    "\n",
    "In some cases we need to continue training: i.e., we have new data or current trainer is not efficient anymore.\n",
    "\n",
    "For this purpose there is `partial_fit` method, where you can continue training using different trainer or different data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    }
   ],
   "source": [
    "tn = TheanetsClassifier(features=variables, layers=[10, 10], \n",
    "                        trainers=[{'algo': 'rprop', 'min_improvement': 0.1}])\n",
    "\n",
    "tn.fit(train_data, train_labels)\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "####  Second stage of fitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    }
   ],
   "source": [
    "tn.partial_fit(train_data, train_labels, **{'algo': 'adagrad', 'min_improvement': 0.1})\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.24486897  0.75513103]\n",
      " [ 0.78883091  0.21116909]\n",
      " [ 0.47429026  0.52570974]\n",
      " ..., \n",
      " [ 0.90560846  0.09439154]\n",
      " [ 0.88662219  0.11337781]\n",
      " [ 0.9052761   0.0947239 ]]\n"
     ]
    }
   ],
   "source": [
    "# predict probabilities for each class\n",
    "prob = tn.predict_proba(test_data)\n",
    "print prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC 0.844713853906\n"
     ]
    }
   ],
   "source": [
    "print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predictions of classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 0, 1, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tn.predict(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neurolab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Classifier from neurolab library. \n",
      "\n",
      "    Parameters:\n",
      "    -----------\n",
      "    :param features: features used in training\n",
      "    :type features: list[str] or None\n",
      "    :param list[int] layers: sequence, number of units inside each **hidden** layer.\n",
      "    :param string net_type: type of network\n",
      "        One of 'feed-forward', 'single-layer', 'competing-layer', 'learning-vector',\n",
      "        'elman-recurrent', 'hopfield-recurrent', 'hemming-recurrent'\n",
      "    :param initf: layer initializers\n",
      "    :type initf: anything implementing call(layer), e.g. nl.init.* or list[nl.init.*] of shape [n_layers]\n",
      "    :param trainf: net train function, default value depends on type of network\n",
      "    :param scaler: transformer to apply to the input objects\n",
      "    :type scaler: str or sklearn-like transformer or False (do not scale features)\n",
      "    :param random_state: ignored, added for uniformity.\n",
      "    :param dict kwargs: additional arguments to net __init__, varies with different net_types\n",
      "\n",
      "    .. seealso:: https://pythonhosted.org/neurolab/lib.html for supported train functions and their parameters.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "from rep.estimators import NeurolabClassifier\n",
    "print NeurolabClassifier.__doc__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's train network using Rprop algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The maximum number of train epochs is reached\n",
      "training complete\n"
     ]
    }
   ],
   "source": [
    "import neurolab\n",
    "nl = NeurolabClassifier(features=variables, layers=[10], epochs=5, trainf=neurolab.train.train_rprop)\n",
    "nl.fit(train_data, train_labels)\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After training neural network you still can improve it by using partial fit on other data:\n",
    "```\n",
    "nl.partial_fit(new_train_data, new_train_labels)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict probabilities and estimate quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.72909063  0.27090937]\n",
      " [ 0.73301084  0.26698916]\n",
      " [ 0.7261278   0.2738722 ]\n",
      " ..., \n",
      " [ 0.72833376  0.27166624]\n",
      " [ 0.72881829  0.27118171]\n",
      " [ 0.72708209  0.27291791]]\n"
     ]
    }
   ],
   "source": [
    "# predict probabilities for each class\n",
    "prob = nl.predict_proba(test_data)\n",
    "print prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC 0.471191281317\n"
     ]
    }
   ],
   "source": [
    "print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predict labels\n",
    "nl.predict(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pybrain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Implements classification from PyBrain library \n",
      "\n",
      "    Parameters:\n",
      "    -----------\n",
      "    :param features: features used in training.\n",
      "    :type features: list[str] or None\n",
      "    :param scaler: transformer to apply to the input objects\n",
      "    :type scaler: str or sklearn-like transformer or False (do not scale features)\n",
      "    :param bool use_rprop: flag to indicate whether we should use Rprop or SGD trainer\n",
      "    :param bool verbose: print train/validation errors.\n",
      "    :param random_state: ignored parameter, pybrain training isn't reproducible\n",
      "\n",
      "    **Net parameters:**\n",
      "\n",
      "    :param list[int] layers: indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons\n",
      "    :param list[str] hiddenclass: classes of the hidden layers; default is 'SigmoidLayer'\n",
      "    :param dict params: other net parameters:\n",
      "        bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases,\n",
      "        both default to True;\n",
      "        peepholes (boolean);\n",
      "        recurrent (boolean) if the `recurrent` flag is set, a :class:`RecurrentNetwork` will be created,\n",
      "        otherwise a :class:`FeedForwardNetwork`\n",
      "\n",
      "    **Gradient descent trainer parameters:**\n",
      "\n",
      "    :param float learningrate: gives the ratio of which parameters are changed into the direction of the gradient\n",
      "    :param float lrdecay: the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step\n",
      "    :param float momentum: the ratio by which the gradient of the last timestep is used\n",
      "    :param boolean batchlearning: if set, the parameters are updated only at the end of each epoch. Default is False\n",
      "    :param float weightdecay: corresponds to the weightdecay rate, where 0 is no weight decay at all\n",
      "\n",
      "    **Rprop trainer parameters:**\n",
      "\n",
      "    :param float etaminus: factor by which step width is decreased when overstepping (0.5)\n",
      "    :param float etaplus: factor by which step width is increased when following gradient (1.2)\n",
      "    :param float delta: step width for each weight\n",
      "    :param float deltamin: minimum step width (1e-6)\n",
      "    :param float deltamax: maximum step width (5.0)\n",
      "    :param float delta0: initial step width (0.1)\n",
      "\n",
      "    **Training termination parameters**\n",
      "\n",
      "    :param int epochs: number of iterations of training; if < 0 then classifier trains until convergence\n",
      "    :param int max_epochs: if is given, at most that many epochs are trained\n",
      "    :param int continue_epochs: each time validation error decreases, try for continue_epochs epochs to find a better one\n",
      "    :param float validation_proportion: the ratio of the dataset that is used for the validation dataset\n",
      "\n",
      "    .. note::\n",
      "\n",
      "        Details about parameters: http://pybrain.org/docs/\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "from rep.estimators import PyBrainClassifier\n",
    "print PyBrainClassifier.__doc__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training complete\n"
     ]
    }
   ],
   "source": [
    "pb = PyBrainClassifier(features=variables, layers=[5], epochs=2, hiddenclass=['TanhLayer'])\n",
    "pb.fit(train_data, train_labels)\n",
    "print('training complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict probabilities and estimate quality\n",
    "again, we could proceed with training and use new dataset\n",
    "```\n",
    "nl.partial_fit(new_train_data, new_train_labels)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC AUC: 0.856107824713\n"
     ]
    }
   ],
   "source": [
    "prob = pb.predict_proba(test_data)\n",
    "print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 0, 1, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pb.predict(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling of features\n",
    "initial prescaling of features is frequently crucial to get some appropriate results using neural networks.\n",
    "\n",
    "By default, all the networks use `StandardScaler` from `sklearn`, but you can use any other transformer, say MinMax or self-written by passing appropriate value as scaler. All the networks have same support of `scaler` parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "NeurolabClassifier(initf=<function init_rand at 0x112431f50>, layers=[10],\n",
       "          net_type='feed-forward', random_state=None, scaler=False,\n",
       "          trainf=None)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "# will use StandardScaler\n",
    "NeurolabClassifier(scaler='standard')\n",
    "# will use MinMaxScaler\n",
    "NeurolabClassifier(scaler=MinMaxScaler())\n",
    "# will not use any pretransformation of features\n",
    "NeurolabClassifier(scaler=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advantages of common interface\n",
    "\n",
    "Let's build an ensemble of neural networks. This will be done by bagging meta-algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging over Theanets classifier\n",
    "\n",
    "A well-known fact is that the classification quality of single neural network can be significantly improved by ensembling.\n",
    "\n",
    "In simplest case, we average predictions of several neural networks. Bagging trains several classifiers on random subsets of training data, and thus achieves higher quality and more stable predictions.\n",
    "\n",
    "You can try the same trick with any other network, not only Theanets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# uncomment the code below to try, this may take much time\n",
    "\n",
    "# from sklearn.ensemble import BaggingClassifier\n",
    "# base_tn = TheanetsClassifier(layers=[10, 7], trainers=[{'algo': 'adadelta'}])\n",
    "# bagging_tn = BaggingClassifier(base_estimator=base_tn, n_estimators=10)\n",
    "# bagging_tn.fit(train_data[variables], train_labels)\n",
    "# prob = bagging_tn.predict_proba(test_data[variables])\n",
    "# print 'AUC', roc_auc_score(test_labels, prob[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Other advantages of common interface\n",
    "There are many things you can do with neural networks now: \n",
    "* cloning\n",
    "* getting / setting parameters as dictionaries \n",
    "* use `grid_search`, play with sizes of hidden layers and other parameters\n",
    "* build pipelines (`sklearn.pipeline`)\n",
    "* use hierarchical training, training on subsets\n",
    "* passing over internet / train classifiers on other machines / distributed learning of ensemles\n",
    "\n",
    "\n",
    "And you can replace classifiers at any moment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## See also \n",
    "Sklearn-compatible libraries you can use within REP:\n",
    "\n",
    "1. [hep_ml.nnet](https://arogozhnikov.github.io/hep_ml/nnet.html) are sklearn-compatible. \n",
    "2. [nolearn](https://github.com/dnouri/nolearn) wrappers are expected to be sklearn-compatible"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}