{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[Sebastian Raschka](http://sebastianraschka.com), 2015\n", "\n", "https://github.com/rasbt/python-machine-learning-book" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Machine Learning - Code Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bonus Material - A Basic Pipeline and Grid Search Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split


# load and split data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# pipeline setup
cls = SVC(C=10.0, 
          kernel='rbf', 
          gamma=0.1, 
          decision_function_shape='ovr')

kernel_svm = Pipeline([('std', StandardScaler()), 
                        ('svc', cls)])

# gridsearch setup
param_grid = [
              {'svc__C': [1, 10, 100, 1000], 
               'svc__gamma': [0.001, 0.0001], 
               'svc__kernel': ['rbf']},
             ]

gs = GridSearchCV(estimator=kernel_svm, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=-1, 
                  cv=5, 
                  verbose=1, 
                  refit=True,
                  pre_dispatch='2*n_jobs')

# run gridearch
gs.fit(X_train, y_train)

print('Best GS Score %.2f' % gs.best_score_)
print('best GS Params %s' % gs.best_params_)


# prediction on the training set
y_pred = gs.predict(X_train)
train_acc = (y_train == y_pred).sum()/len(y_train)
print('\nTrain Accuracy: %.2f' % (train_acc))

# evaluation on the test set
y_pred = gs.predict(X_test)
test_acc = (y_test == y_pred).sum()/len(y_test)
print('\nTest Accuracy: %.2f' % (test_acc))

### A Note about `GridSearchCV`'s `best_score_` attribute

Please note that `gs.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.6, 0.4, 0.6, 0.2, 0.6])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cross_validation import StratifiedKFold, cross_val_score\n", "from sklearn.linear_model import LogisticRegression\n", "import numpy as np\n", "\n", "np.random.seed(0)\n", "np.set_printoptions(precision=6)\n", "y = [np.random.randint(3) for i in range(25)]\n", "X = (y + np.random.randn(25)).reshape(-1, 1)\n", "\n", "cv5_idx = list(StratifiedKFold(y, n_folds=5, shuffle=False, random_state=0))\n", "cross_val_score(LogisticRegression(random_state=123), X, y, cv=cv5_idx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (`cv3_idx`) to the `cross_val_score` scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds. \n", "\n", "Next, let us use the `GridSearchCV` object and feed it the same 5 cross-validation sets (via the pre-generated `cv3_idx` indices):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 1 candidates, totalling 5 fits\n", "[CV] ................................................................\n", "[CV] ....................................... , score=0.600000 - 0.0s\n", "[CV] ................................................................\n", "[CV] ....................................... , score=0.400000 - 0.0s\n", "[CV] ................................................................\n", "[CV] ....................................... , score=0.600000 - 0.0s\n", "[CV] ................................................................\n", "[CV] ....................................... , score=0.200000 - 0.0s\n", "[CV] ................................................................\n", "[CV] ....................................... , score=0.600000 - 0.0s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished\n" ] } ], "source": [ "from sklearn.grid_search import GridSearchCV\n", "gs = GridSearchCV(LogisticRegression(), {}, cv=cv5_idx, verbose=3).fit(X, y) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the scores for the 5 folds are exactly the same as the ones from `cross_val_score` earlier. \n", "Now, the best_score_ attribute of the `GridSearchCV` object, which becomes available after `fit`ting, returns the average accuracy score of the best model:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.47999999999999998" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gs.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the result above is consistent with the average score computed the `cross_val_score`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.47999999999999998" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(LogisticRegression(), X, y, cv=cv5_idx).mean()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }