{ "metadata": { "name": "", "signature": "sha256:a36ead627e13e173214d136b8a6e1f1c1520da396fbd1d41d81610bcbe817aa5" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%load_ext load_style\n", "%load_style talk.css" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Classification with scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we're going to do during this session is give an example of **supervised** learning, and more specifically we're going to see how to solve a **classification** problem in scikit-learn, with a focus on how one evaluates the performance of a model. \n", "\n", "We're going to use a dataset that comes with scikit-learn, which consists in representation of hand-written digits (8 x 8 pixels normalized images) with the associated label (the correct digit)\n", "\n", "This example is treated in a more comprehensive manner by [Olivier Grisel](http://ogrisel.com/) (see his notebooks [here](https://github.com/ogrisel/parallel_ml_tutorial))" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "from IPython.display import Image, HTML\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "print(digits.DESCR)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = digits.data, digits.target\n", "\n", "print(\"data shape: %r, target shape: %r\" % (X.shape, y.shape))\n", "print(\"labels: %r\" % list(np.unique(y)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_gallery(data, labels, shape, interpolation='nearest'):\n", " f,ax = plt.subplots(1,5,figsize=(16,5))\n", " for i in range(data.shape[0]):\n", " ax[i].imshow(data[i].reshape(shape), interpolation=interpolation, cmap=plt.cm.gray_r)\n", " ax[i].set_title(labels[i])\n", " ax[i].set_xticks(()), ax[i].set_yticks(())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "subsample = np.random.permutation(X.shape[0])[:5]\n", "images = X[subsample]\n", "labels = ['True label: %d' % l for l in y[subsample]]\n", "plot_gallery(images, labels, shape=(8, 8))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "example of hand-written digit classification with Support Vector Machines (SVM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are importing the svm.SVC (Support Vector **Classifier** class) from scikit-learn" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import SVC" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "instanciation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc = SVC()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "fitting" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc.fit(X, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "scoring" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc.score(X,y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "prediction" ] }, { "cell_type": "code", "collapsed": false, "input": [ "y_hat = svc.predict(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "np.alltrue(y_hat == y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Have we got a perfect model ???" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we are making an important methodological mistake: we are using **all the instances available** to **train** the model, and using the **same** instances to **evaluate** the model in terms of accuracy. It tell us (almost) nothing about the actual performance in production of the model, just how well it can reproduce the data it's been exposed too ...\n", "\n", "A way to work around that is to train the model over a **subset** of the available instances (the **training set**), calculate the train score, and test the model (i.e. calculate the test score) over the remaining of the instances (the **test set**).\n", "\n", "**Cross-validation** consists into repeating this operation several times using successive splits of the original dataset into training and test sets, and calculating a summary statistic of the train and test scores over the iterations (usually average).\n", "\n", "\n", "Several splits can be used: \n", "\n", "+ **Random split**: a given percentage of the data is selected at random (with replacement) \t\n", "+ **K-folds**: the dataset is divided into K exhaustive splits, each split is used as the test set, while the K-1 splits are using as the training set\n", "+ **Stratified K-folds**: for classification mainly. The folds are constructed so that the class distribution is approximately the same in each fold (e.g. the relative frequency of each class is preserved)\n", "+ **Leave One Out**: like K-fold with K = 1. One instance is left out, the model is built on the N-1 remaining instances, this procedure is repeated until all the instances have been used.\n" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "cross-validation in scikit-learn" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, \\\n", " test_size=0.25, random_state=1)\n", "\n", "print(\"train data shape: %r, train target shape: %r\"\n", " % (X_train.shape, y_train.shape))\n", "print(\"test data shape: %r, test target shape: %r\"\n", " % (X_test.shape, y_test.shape))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc = SVC().fit(X_train, y_train)\n", "train_score = svc.score(X_train, y_train) \n", "train_score" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "test_score = svc.score(X_test, y_test)\n", "test_score" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok that seems more like a 'normal' result ...\n", "\n", "- if the **test** data score is **not as good as** the **train** score the model is **overfitting**\n", "\n", "- if the **train score is not close to 100%** accuracy the model is **underfitting**\n", "\n", "Ideally **we want to neither overfit nor underfit**: `test_score ~= train_score ~= 1.0`. \n", "\n", "When setting up a Support Vector Machine classifier, one needs to set up 2 parameters (hyper-parameters) which are NOT tuned at the fitting stage (they are NOT learned). These are **C** and **$\\gamma$** (see the [relevant section](http://en.wikipedia.org/wiki/Support_vector_machine#Parameter_selection) in the wikipedia article). What we did before is to instanciate the SVC class without specifying these parameters, which means that the default are used. Let's try something else. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_2 = SVC(C=100, gamma=0.001).fit(X_train, y_train)\n", "svc_2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_2.score(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_2.score(X_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "sum(svc_2.predict(X_test) == y_test) / float(len(y_test))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Could be luck (we only used one train / test split here): Now we're going to use **cross validation** to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs\n", "\n", "scikit-learn provides a very convenient interface to do that: ```sklearn.cross_validation```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import cross_validation" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "cross_validation." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "cross_validation.ShuffleSplit?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "cv = cross_validation.ShuffleSplit(len(X), n_iter=3, test_size=0.2,\n", " random_state=0)\n", "\n", "for cv_index, (train, test) in enumerate(cv):\n", " print(\"# Cross Validation Iteration #%d\" % cv_index)\n", " print(\"train indices: {0}...\".format(train[:10]))\n", " print(\"test indices: {0}...\".format(test[:10]))\n", " \n", " svc = SVC(C=100, gamma=0.001).fit(X[train], y[train])\n", " print(\"train score: {0:.3f}, test score: {1:.3f}\\n\".format(\n", " svc.score(X[train], y[train]), svc.score(X[test], y[test])))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a wrapper for estimating cross validated scores directly, you just have to pass the cross validation method instanciated before" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import cross_val_score\n", "\n", "svc = SVC(C=100, gamma=0.001)\n", "\n", "cv = cross_validation.ShuffleSplit(len(X), n_iter=10, test_size=0.2,\n", " random_state=0)\n", "\n", "test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=4) # n_jobs = 4 if you have a quad-core machine ...\n", "test_scores" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross validation can be used to estimate the best hyperparameters for a model\n", "\n", "Let's see what happens when we fix C but vary $\\gamma$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_iter = 5 # the number of iterations should be more than that ... \n", "\n", "gammas = np.logspace(-7, -1, 10) # should be more fine grained ... \n", "\n", "cv = cross_validation.ShuffleSplit(len(X), n_iter=n_iter, test_size=0.2)\n", "\n", "train_scores = np.zeros((len(gammas), n_iter))\n", "test_scores = np.zeros((len(gammas), n_iter))\n", "\n", "for i, gamma in enumerate(gammas):\n", " for j, (train, test) in enumerate(cv):\n", " C = 1\n", " clf = SVC(C=C, gamma=gamma).fit(X[train], y[train])\n", " train_scores[i, j] = clf.score(X[train], y[train])\n", " test_scores[i, j] = clf.score(X[test], y[test])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "f, ax = plt.subplots(figsize=(12,8))\n", "#for i in range(n_iter):\n", "# ax.semilogx(gammas, train_scores[:, i], alpha=0.2, lw=2, c='b')\n", "# ax.semilogx(gammas, test_scores[:, i], alpha=0.2, lw=2, c='g')\n", "ax.semilogx(gammas, test_scores.mean(1), lw=4, c='g', label='test score')\n", "ax.semilogx(gammas, train_scores.mean(1), lw=4, c='b', label='train score')\n", "\n", "\n", "ax.fill_between(gammas, train_scores.min(1), train_scores.max(1), color = 'b', alpha=0.2)\n", "ax.fill_between(gammas, test_scores.min(1), test_scores.max(1), color = 'g', alpha=0.2)\n", "\n", "ax.set_ylabel(\"score for SVC(C=%4.2f, $\\gamma=\\gamma$)\" % ( C ),fontsize=16)\n", "ax.set_xlabel(r\"$\\gamma$\",fontsize=16)\n", "best_gamma = gammas[np.argmax(test_scores.mean(1))]\n", "best_score = test_scores.mean(1).max()\n", "ax.text(best_gamma, best_score+0.05, \"$\\gamma$ = %6.4f | score=%6.4f\" % (best_gamma, best_score),\\\n", " fontsize=15, bbox=dict(facecolor='w',alpha=0.5))\n", "[x.set_fontsize(16) for x in ax.xaxis.get_ticklabels()]\n", "[x.set_fontsize(16) for x in ax.yaxis.get_ticklabels()]\n", "ax.legend(fontsize=16, loc=0)\n", "ax.set_ylim(0, 1.1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Grid-search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can search the (hyper) parameter space and find the best hyperparameters using grid search in scikit-learn" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.grid_search import GridSearchCV" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_params = {\n", " 'C': np.logspace(-1, 2, 4),\n", " 'gamma': np.logspace(-4, 0, 5),\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=4)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc.fit(X, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc.best_params_, gs_svc.best_score_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercise: predicting the quality of a wine given a set of physicochemical measurements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two datasets were created, using red and white wine samples.\n", "The inputs include objective tests (e.g. PH values) and the output is based on sensory data\n", "(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality \n", "between 0 (very bad) and 10 (very excellent).\n", "\n", "P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. \n", "Modeling wine preferences by data mining from physicochemical properties.\n", "In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.\n", "\n", "This dataset is available from the UC Irvine Machine Learning Repo [http://archive.ics.uci.edu/ml/datasets/Wine+Quality](http://archive.ics.uci.edu/ml/datasets/Wine+Quality)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can try several **classification** approaches for the quality (10 discrete classes for `quality`) or you can try (using either statsmodels or sklearn)\n", "**regressions** approaches: e.g. predicting the alcohol content given the other (or subset thereof) measurements." ] }, { "cell_type": "code", "collapsed": false, "input": [ "wine = pd.read_csv('./data/winequality-red.csv', sep=';')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "wine.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below an example of classification (using the same SVC classifier) \n", "\n", "you need to add the cross-validation step" ] }, { "cell_type": "code", "collapsed": false, "input": [ "quality = wine.pop('quality')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y = quality.values" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X = wine.values" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.preprocessing import StandardScaler as scaler" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "scaler = scaler()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "scaler.fit(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "Xscaled = scaler.transform(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import SVC" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc = SVC()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc.fit(Xscaled, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y_hat = svc.predict(Xscaled)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y_hat " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc.score(X, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import confusion_matrix" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "confusion_matrix(y, y_hat)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }