{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Sebastian Raschka](http://sebastianraschka.com), 2015\n",
    "\n",
    "https://github.com/rasbt/python-machine-learning-book"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python Machine Learning - Code Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bonus Material - An Extended Nested Cross-Validation Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For an explanation of nested cross-validation, please see:\n",
    "    \n",
    "- Chapter 6, section \"Algorithm-selection-with-nested-cross-validation\" (open the code example via [nbviewer](http://nbviewer.ipython.org/github/rasbt/python-machine-learning-book/blob/master/code/ch06/ch06.ipynb#Algorithm-selection-with-nested-cross-validation))\n",
    "- FAQ, section: [How do I evaluate a model?](https://github.com/rasbt/python-machine-learning-book/blob/master/faq/evaluate-a-model.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sebastian Raschka \n",
      "Last updated: 11/30/2015 \n",
      "\n",
      "CPython 3.5.0\n",
      "IPython 4.0.0\n",
      "\n",
      "numpy 1.10.1\n",
      "pandas 0.17.1\n",
      "matplotlib 1.5.0\n",
      "scikit-learn 0.17\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset and Estimator Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.grid_search import GridSearchCV\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "\n",
    "# load and split data\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\n",
    "\n",
    "# pipeline setup\n",
    "cls = SVC(C=10.0, kernel='rbf', gamma=0.1, decision_function_shape='ovr')\n",
    "kernel_svm = Pipeline([('std', StandardScaler()), \n",
    "                       ('svc', cls)])\n",
    "\n",
    "# gridsearch setup\n",
    "param_grid = [\n",
    "  {'svc__C': [1, 10, 100, 1000], \n",
    "   'svc__gamma': [0.001, 0.0001], \n",
    "   'svc__kernel': ['rbf']},\n",
    " ]\n",
    "\n",
    "\n",
    "# setup multiple GridSearchCV objects, 1 for each algorithm\n",
    "\n",
    "gs_svm = GridSearchCV(estimator=kernel_svm, \n",
    "                       param_grid=param_grid, \n",
    "                       scoring='accuracy', \n",
    "                       n_jobs=-1, \n",
    "                       cv=5, \n",
    "                       verbose=0, \n",
    "                       refit=True,\n",
    "                       pre_dispatch='2*n_jobs')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A. Nested Crossvalidation - Quick Version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the `cross_val_function` runs the 5 outer loops, and the the `GridSearch` object (`gs`) peforms the hyperparameter optimization during the 5 inner loops."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Average Accuracy 0.95 +/- 0.06\n"
     ]
    }
   ],
   "source": [
    "import numpy as np \n",
    "\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "scores = cross_val_score(gs_svm, X_train, y_train, scoring='accuracy', cv=5)\n",
    "print('\\nAverage Accuracy %.2f +/- %.2f' % (np.mean(scores), np.std(scores)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## B. Nested Crossvalidation - Manual Approach Printing the Model Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import StratifiedKFold\n",
    "from sklearn.metrics import accuracy_score\n",
    "import numpy as np\n",
    "\n",
    "params = []\n",
    "scores = []\n",
    "\n",
    "skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=False, random_state=1)\n",
    "for train_idx, test_idx in skfold:\n",
    "    gs_svm.fit(X_train[train_idx], y_train[train_idx])\n",
    "    y_pred = gs_svm.predict(X_train[test_idx])\n",
    "    acc = accuracy_score(y_true=y_train[test_idx], y_pred=y_pred)\n",
    "    params.append(gs_svm.best_params_)\n",
    "    scores.append(acc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SVM models:\n",
      "1. Acc: 0.96 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n",
      "2. Acc: 1.00 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n",
      "3. Acc: 0.83 Params: {'svc__C': 1000, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n",
      "4. Acc: 1.00 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n",
      "5. Acc: 0.96 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n",
      "\n",
      "Average Accuracy 0.95 +/- 0.06\n"
     ]
    }
   ],
   "source": [
    "print('SVM models:')\n",
    "for idx, m in enumerate(zip(params, scores)):\n",
    "    print('%s. Acc: %.2f Params: %s' % (idx+1, m[1], m[0]))\n",
    "print('\\nAverage Accuracy %.2f +/- %.2f' % (np.mean(scores), np.std(scores)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular K-fold CV to Optimize the Model on the Complete Training Set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Repeat the nested cross-validation for different algorithms. Then, pick the \"best\" algorithm (not the best model!). Next, use the complete training set to tune the best algorithm via grid search:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n"
     ]
    }
   ],
   "source": [
    "gs_svm.fit(X_train, y_train)\n",
    "print('Best parameters %s' % gs_svm.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training accuracy: 0.97\n",
      "Test accuracy: 0.97\n",
      "Parameters: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n"
     ]
    }
   ],
   "source": [
    "train_acc = accuracy_score(y_true=y_train, y_pred=gs_svm.predict(X_train))\n",
    "test_acc = accuracy_score(y_true=y_test, y_pred=gs_svm.predict(X_test))\n",
    "print('Training accuracy: %.2f' % train_acc)\n",
    "print('Test accuracy: %.2f' % test_acc)\n",
    "print('Parameters: %s' % gs_svm.best_params_)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}