{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Sebastian Raschka](http://sebastianraschka.com), 2015\n",
    "\n",
    "https://github.com/rasbt/python-machine-learning-book"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python Machine Learning - Code Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bonus Material - A Basic Pipeline and Grid Search Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sebastian Raschka \n",
      "Last updated: 01/20/2016 \n",
      "\n",
      "CPython 3.5.1\n",
      "IPython 4.0.1\n",
      "\n",
      "numpy 1.10.1\n",
      "pandas 0.17.1\n",
      "matplotlib 1.5.0\n",
      "scikit-learn 0.17\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 8 candidates, totalling 40 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    0.2s finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise',\n",
       "       estimator=Pipeline(steps=[('std', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False))]),\n",
       "       fit_params={}, iid=True, n_jobs=-1,\n",
       "       param_grid=[{'svc__kernel': ['rbf'], 'svc__C': [1, 10, 100, 1000], 'svc__gamma': [0.001, 0.0001]}],\n",
       "       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=1)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.grid_search import GridSearchCV\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "\n",
    "# load and split data\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\n",
    "\n",
    "# pipeline setup\n",
    "cls = SVC(C=10.0, \n",
    "          kernel='rbf', \n",
    "          gamma=0.1, \n",
    "          decision_function_shape='ovr')\n",
    "\n",
    "kernel_svm = Pipeline([('std', StandardScaler()), \n",
    "                       ('svc', cls)])\n",
    "\n",
    "# gridsearch setup\n",
    "param_grid = [\n",
    "  {'svc__C': [1, 10, 100, 1000], \n",
    "   'svc__gamma': [0.001, 0.0001], \n",
    "   'svc__kernel': ['rbf']},\n",
    " ]\n",
    "\n",
    "gs = GridSearchCV(estimator=kernel_svm, \n",
    "                  param_grid=param_grid, \n",
    "                  scoring='accuracy', \n",
    "                  n_jobs=-1, \n",
    "                  cv=5, \n",
    "                  verbose=1, \n",
    "                  refit=True,\n",
    "                  pre_dispatch='2*n_jobs')\n",
    "\n",
    "# run gridearch\n",
    "gs.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best GS Score 0.96\n",
      "best GS Params {'svc__kernel': 'rbf', 'svc__C': 100, 'svc__gamma': 0.001}\n",
      "\n",
      "Train Accuracy: 0.97\n",
      "\n",
      "Test Accuracy: 0.97\n"
     ]
    }
   ],
   "source": [
    "print('Best GS Score %.2f' % gs.best_score_)\n",
    "print('best GS Params %s' % gs.best_params_)\n",
    "\n",
    "\n",
    "# prediction on the training set\n",
    "y_pred = gs.predict(X_train)\n",
    "train_acc = (y_train == y_pred).sum()/len(y_train)\n",
    "print('\\nTrain Accuracy: %.2f' % (train_acc))\n",
    "\n",
    "# evaluation on the test set\n",
    "y_pred = gs.predict(X_test)\n",
    "test_acc = (y_test == y_pred).sum()/len(y_test)\n",
    "print('\\nTest Accuracy: %.2f' % (test_acc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### A Note about `GridSearchCV`'s `best_score_` attribute"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please note that `gs.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.6,  0.4,  0.6,  0.2,  0.6])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.cross_validation import StratifiedKFold, cross_val_score\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(0)\n",
    "np.set_printoptions(precision=6)\n",
    "y = [np.random.randint(3) for i in range(25)]\n",
    "X = (y + np.random.randn(25)).reshape(-1, 1)\n",
    "\n",
    "cv5_idx = list(StratifiedKFold(y, n_folds=5, shuffle=False, random_state=0))\n",
    "cross_val_score(LogisticRegression(random_state=123), X, y, cv=cv5_idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (`cv3_idx`) to the `cross_val_score` scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds.  \n",
    "\n",
    "Next, let us use the `GridSearchCV` object and feed it the same 5 cross-validation sets (via the pre-generated `cv3_idx` indices):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 1 candidates, totalling 5 fits\n",
      "[CV]  ................................................................\n",
      "[CV] ....................................... , score=0.600000 -   0.0s\n",
      "[CV]  ................................................................\n",
      "[CV] ....................................... , score=0.400000 -   0.0s\n",
      "[CV]  ................................................................\n",
      "[CV] ....................................... , score=0.600000 -   0.0s\n",
      "[CV]  ................................................................\n",
      "[CV] ....................................... , score=0.200000 -   0.0s\n",
      "[CV]  ................................................................\n",
      "[CV] ....................................... , score=0.600000 -   0.0s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished\n"
     ]
    }
   ],
   "source": [
    "from sklearn.grid_search import GridSearchCV\n",
    "gs = GridSearchCV(LogisticRegression(), {}, cv=cv5_idx, verbose=3).fit(X, y) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the scores for the 5 folds are exactly the same as the ones from `cross_val_score` earlier.  \n",
    "Now, the best_score_ attribute of the `GridSearchCV` object, which becomes available after `fit`ting, returns the average accuracy score of the best model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.47999999999999998"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the result above is consistent with the average score computed the `cross_val_score`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.47999999999999998"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cross_val_score(LogisticRegression(), X, y, cv=cv5_idx).mean()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}