{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[Sebastian Raschka](http://sebastianraschka.com), 2015\n", "\n", "https://github.com/rasbt/python-machine-learning-book" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Machine Learning - Code Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bonus Material - An Extended Nested Cross-Validation Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For an explanation of nested cross-validation, please see:\n", " \n", "- Chapter 6, section \"Algorithm-selection-with-nested-cross-validation\" (open the code example via [nbviewer](http://nbviewer.ipython.org/github/rasbt/python-machine-learning-book/blob/master/code/ch06/ch06.ipynb#Algorithm-selection-with-nested-cross-validation))\n", "- FAQ, section: [How do I evaluate a model?](https://github.com/rasbt/python-machine-learning-book/blob/master/faq/evaluate-a-model.md)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sebastian Raschka \n", "Last updated: 11/30/2015 \n", "\n", "CPython 3.5.0\n", "IPython 4.0.0\n", "\n", "numpy 1.10.1\n", "pandas 0.17.1\n", "matplotlib 1.5.0\n", "scikit-learn 0.17\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset and Estimator Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.grid_search import GridSearchCV\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.svm import SVC\n", "from sklearn.datasets import load_iris\n", "from sklearn.cross_validation import train_test_split\n", "\n", "\n", "# load and split data\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)\n", "\n", "# pipeline setup\n", "cls = SVC(C=10.0, kernel='rbf', gamma=0.1, decision_function_shape='ovr')\n", "kernel_svm = Pipeline([('std', StandardScaler()), \n", " ('svc', cls)])\n", "\n", "# gridsearch setup\n", "param_grid = [\n", " {'svc__C': [1, 10, 100, 1000], \n", " 'svc__gamma': [0.001, 0.0001], \n", " 'svc__kernel': ['rbf']},\n", " ]\n", "\n", "\n", "# setup multiple GridSearchCV objects, 1 for each algorithm\n", "\n", "gs_svm = GridSearchCV(estimator=kernel_svm, \n", " param_grid=param_grid, \n", " scoring='accuracy', \n", " n_jobs=-1, \n", " cv=5, \n", " verbose=0, \n", " refit=True,\n", " pre_dispatch='2*n_jobs')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A. Nested Crossvalidation - Quick Version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the `cross_val_function` runs the 5 outer loops, and the the `GridSearch` object (`gs`) peforms the hyperparameter optimization during the 5 inner loops." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Average Accuracy 0.95 +/- 0.06\n" ] } ], "source": [ "import numpy as np \n", "\n", "from sklearn.cross_validation import cross_val_score\n", "scores = cross_val_score(gs_svm, X_train, y_train, scoring='accuracy', cv=5)\n", "print('\\nAverage Accuracy %.2f +/- %.2f' % (np.mean(scores), np.std(scores)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B. Nested Crossvalidation - Manual Approach Printing the Model Parameters" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import StratifiedKFold\n", "from sklearn.metrics import accuracy_score\n", "import numpy as np\n", "\n", "params = []\n", "scores = []\n", "\n", "skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=False, random_state=1)\n", "for train_idx, test_idx in skfold:\n", " gs_svm.fit(X_train[train_idx], y_train[train_idx])\n", " y_pred = gs_svm.predict(X_train[test_idx])\n", " acc = accuracy_score(y_true=y_train[test_idx], y_pred=y_pred)\n", " params.append(gs_svm.best_params_)\n", " scores.append(acc)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SVM models:\n", "1. Acc: 0.96 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n", "2. Acc: 1.00 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n", "3. Acc: 0.83 Params: {'svc__C': 1000, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n", "4. Acc: 1.00 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n", "5. Acc: 0.96 Params: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n", "\n", "Average Accuracy 0.95 +/- 0.06\n" ] } ], "source": [ "print('SVM models:')\n", "for idx, m in enumerate(zip(params, scores)):\n", " print('%s. Acc: %.2f Params: %s' % (idx+1, m[1], m[0]))\n", "print('\\nAverage Accuracy %.2f +/- %.2f' % (np.mean(scores), np.std(scores)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular K-fold CV to Optimize the Model on the Complete Training Set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeat the nested cross-validation for different algorithms. Then, pick the \"best\" algorithm (not the best model!). Next, use the complete training set to tune the best algorithm via grid search:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best parameters {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n" ] } ], "source": [ "gs_svm.fit(X_train, y_train)\n", "print('Best parameters %s' % gs_svm.best_params_)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training accuracy: 0.97\n", "Test accuracy: 0.97\n", "Parameters: {'svc__C': 100, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}\n" ] } ], "source": [ "train_acc = accuracy_score(y_true=y_train, y_pred=gs_svm.predict(X_train))\n", "test_acc = accuracy_score(y_true=y_test, y_pred=gs_svm.predict(X_test))\n", "print('Training accuracy: %.2f' % train_acc)\n", "print('Test accuracy: %.2f' % test_acc)\n", "print('Parameters: %s' % gs_svm.best_params_)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }