{ "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" }, "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Model Selection and Assessment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Outline of the session:\n", "\n", "- Model performance evaluation and **detection of overfitting with Cross-Validation**\n", "- **Hyper parameter tuning** and model selection with Grid Search\n", "- Error analysis with **learning curves** and the **Bias-Variance trade-off**\n", "- Overfitting via Model Selection and the **Development / Evaluation set split**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "# Some nice default configuration for plots\n", "plt.rcParams['figure.figsize'] = 10, 7.5\n", "plt.rcParams['axes.grid'] = True\n", "plt.gray()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "The Hand Written Digits Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load a simple dataset of 8x8 gray level images of handwritten digits (bundled in the sklearn source code):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "list(digits.keys())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(digits.DESCR)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = digits.data, digits.target\n", "\n", "print(\"data shape: %r, target shape: %r\" % (X.shape, y.shape))\n", "print(\"classes: %r\" % list(np.unique(y)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "n_samples, n_features = X.shape\n", "print(\"n_samples=%d\" % n_samples)\n", "print(\"n_features=%d\" % n_features)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_gallery(data, labels, shape, interpolation='nearest'):\n", " for i in range(data.shape[0]):\n", " plt.subplot(1, data.shape[0], (i + 1))\n", " plt.imshow(data[i].reshape(shape), interpolation=interpolation)\n", " plt.title(labels[i])\n", " plt.xticks(()), plt.yticks(())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "subsample = np.random.permutation(X.shape[0])[:5]\n", "images = X[subsample]\n", "labels = ['True class: %d' % l for l in y[subsample]]\n", "plot_gallery(images, labels, shape=(8, 8))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the dataset on a 2D plane using a projection on the first 2 axis extracted by Principal Component Analysis:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import RandomizedPCA\n", "\n", "pca = RandomizedPCA(n_components=2)\n", "X_pca = pca.fit_transform(X)\n", "\n", "X_pca.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from itertools import cycle\n", "\n", "colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']\n", "markers = ['+', 'o', '^', 'v', '<', '>', 'D', 'h', 's']\n", "for i, c, m in zip(np.unique(y), cycle(colors), cycle(markers)):\n", " plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],\n", " c=c, marker=m, label=i, alpha=0.5)\n", " \n", "_ = plt.legend(loc='best')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe that even in 2D, the groups of digits are quite well separated, especially the digit \"0\" that is very different from any other (the closest being \"6\" as it often share most the left hand side pixels). We can also observe that at least in 2D, there is quite a bit of overlap between the \"1\", \"2\" and \"7\" digits.\n", "\n", "To better understand the meaning of the \"x\" and \"y\" axes of this plot it is also visualize the values of the first two principal components that are used to compute this projection:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "labels = ['Component #%d' % i for i in range(len(pca.components_))]\n", "plot_gallery(pca.components_, labels, shape=(8, 8))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Has this dataset is small, both in terms of number of samples (1797) and features (64), we can compute the full (untruncated), exact PCA and have a look at the percentage of variance explained by each component of the PCA model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA\n", "\n", "pca_big = PCA().fit(X, y)\n", "plt.title(\"Explained Variance\")\n", "plt.ylabel(\"Percentage of explained variance\")\n", "plt.xlabel(\"PCA Components\")\n", "plt.plot(pca_big.explained_variance_ratio_);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It might be easier to interpret by plotting the cumulated variance by previous components by using the `numpy.cumsum` function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plt.title(\"Cumulated Explained Variance\")\n", "plt.ylabel(\"Percentage of explained variance\")\n", "plt.xlabel(\"PCA Components\")\n", "plt.plot(np.cumsum(pca_big.explained_variance_ratio_));" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Overfitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overfitting is the problem of learning the training data by heart and being unable to generalize by making correct predictions on data samples unseen while training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate this, let's train a Support Vector Machine naively on the digits dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import SVC\n", "SVC().fit(X, y).score(X, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did we really learn a perfect model that can recognize the correct digit class 100% of the time? **Without new data it's impossible to tell.**\n", "\n", "Let's start again and split the dataset into two random, non overlapping subsets:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.25, random_state=0)\n", "\n", "print(\"train data shape: %r, train target shape: %r\"\n", " % (X_train.shape, y_train.shape))\n", "print(\"test data shape: %r, test target shape: %r\"\n", " % (X_test.shape, y_test.shape))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's retrain a new model on the first subset call the **training set**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc = SVC(kernel='rbf').fit(X_train, y_train)\n", "train_score = svc.score(X_train, y_train) \n", "train_score" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now compute the performance of the model on new, held out data from the **test set**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "test_score = svc.score(X_test, y_test)\n", "test_score" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This score is clearly not as good as expected! The model cannot generalize so well to new, unseen data.\n", "\n", "- Whenever the **test** data score is **not as good as** the **train** score the model is **overfitting**\n", "\n", "- Whenever the **train score is not close to 100%** accuracy the model is **underfitting**\n", "\n", "Ideally **we want to neither overfit nor underfit**: `test_score ~= train_score ~= 1.0`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous example failed to generalized well to test data because we naively used the default parameters of the `SVC` class:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try again with another parameterization:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_2 = SVC(kernel='rbf', C=100, gamma=0.001).fit(X_train, y_train)\n", "svc_2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_2.score(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_2.score(X_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case the model is almost perfectly able to generalize, at least according to our random train, test split." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Cross Validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross Validation is a procedure to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs.\n", "\n", "The `sklearn.cross_validation` package provides many strategies to compute such splits using class that implement the python iterator API:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import ShuffleSplit\n", "\n", "cv = ShuffleSplit(n_samples, n_iter=3, test_size=0.1,\n", " random_state=0)\n", "\n", "for cv_index, (train, test) in enumerate(cv):\n", " print(\"# Cross Validation Iteration #%d\" % cv_index)\n", " print(\"train indices: {0}...\".format(train[:10]))\n", " print(\"test indices: {0}...\".format(test[:10]))\n", " \n", " svc = SVC(kernel=\"rbf\", C=1, gamma=0.001).fit(X[train], y[train])\n", " print(\"train score: {0:.3f}, test score: {1:.3f}\\n\".format(\n", " svc.score(X[train], y[train]), svc.score(X[test], y[test])))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of doing the above manually, `sklearn.cross_validation` provides a little utility function to compute the cross validated test scores automatically:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import cross_val_score\n", "\n", "svc = SVC(kernel=\"rbf\", C=1, gamma=0.001)\n", "cv = ShuffleSplit(n_samples, n_iter=10, test_size=0.1,\n", " random_state=0)\n", "\n", "test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=2)\n", "test_scores" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from scipy.stats import sem\n", "\n", "def mean_score(scores):\n", " \"\"\"Print the empirical mean score and standard error of the mean.\"\"\"\n", " return (\"Mean score: {0:.3f} (+/-{1:.3f})\").format(\n", " np.mean(scores), 2 * sem(scores))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(mean_score(test_scores))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** \n", "\n", "- Perform 50 iterations of cross validation with randomly sampled folds of 500 training samples and 500 test samples randomly sampled from `X` and `y` (use `sklearn.cross_validation.ShuffleSplit`).\n", "- Try with `SVC(C=1, gamma=0.01)`\n", "- Plot distribution the test error using an histogram with 50 bins.\n", "- Try to increase the training size\n", "- Retry with `SVC(C=10, gamma=0.005)`, then `SVC(C=10, gamma=0.001)` with 500 samples.\n", "\n", "- Optional: use a smoothed kernel density estimation `scipy.stats.kde.gaussian_kde` instead of an histogram to visualize the test error distribution.\n", "\n", "Hints, type:\n", "\n", " from sklearn.cross_validation import ShuffleSplit\n", " ShuffleSplit? # to read the docstring of the shuffle split\n", " plt.hist? # to read the docstring of the histogram plot\n" ] }, { "cell_type": "code", "collapsed": true, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/05A_large_cross_validation.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/05B_cross_validation_score_histogram.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Model Selection with Grid Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross Validation makes it possible to evaluate the performance of a model class and its hyper parameters on the task at hand.\n", "\n", "A natural extension is thus to run CV several times for various values of the parameters so as to find the best. For instance, let's fix the SVC parameter to `C=10` and compute the cross validated test score for various values of `gamma`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_gammas = 10\n", "n_iter = 5\n", "cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,\n", " random_state=0)\n", "\n", "train_scores = np.zeros((n_gammas, n_iter))\n", "test_scores = np.zeros((n_gammas, n_iter))\n", "gammas = np.logspace(-7, -1, n_gammas)\n", "\n", "for i, gamma in enumerate(gammas):\n", " for j, (train, test) in enumerate(cv):\n", " clf = SVC(C=10, gamma=gamma).fit(X[train], y[train])\n", " train_scores[i, j] = clf.score(X[train], y[train])\n", " test_scores[i, j] = clf.score(X[test], y[test])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_validation_curves(param_values, train_scores, test_scores):\n", " for i in range(train_scores.shape[1]):\n", " plt.semilogx(param_values, train_scores[:, i], alpha=0.4, lw=2, c='b')\n", " plt.semilogx(param_values, test_scores[:, i], alpha=0.4, lw=2, c='g')\n", "\n", "plot_validation_curves(gammas, train_scores, test_scores)\n", "plt.ylabel(\"score for SVC(C=10, gamma=gamma)\")\n", "plt.xlabel(\"gamma\")\n", "plt.text(1e-6, 0.5, \"Underfitting\", fontsize=16, ha='center', va='bottom')\n", "plt.text(1e-4, 0.5, \"Good\", fontsize=16, ha='center', va='bottom')\n", "plt.text(1e-2, 0.5, \"Overfitting\", fontsize=16, ha='center', va='bottom')\n", "plt.title('Validation curves for the gamma parameter');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that, **for this model class, on this unscaled dataset**: when `C=10`, **there is a sweet spot region for gamma around $10^4$ to $10^3$**. Both the train and test scores are high (low errors).\n", "\n", "- If **gamma is too low, train score is low** (and thus test scores too as it generally cannot be better than the train score): the model is not expressive enough to represent the data: the model is in an **underfitting regime**.\n", " \n", "- If **gamma is too high**, train score is ok but there is a high discrepency between test and train score. The model is learning the training data and its noise by heart and fails to generalize to new unseen data: the model is in an **overfitting regime**.\n", "\n", "Note: scikit-learn provides tools to compute such curves easily, we can do the same kind analysis to identify good values for C when gamma is fixed to $10^3$:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.learning_curve import validation_curve\n", "\n", "n_Cs = 10\n", "Cs = np.logspace(-5, 5, n_Cs)\n", "\n", "train_scores, test_scores = validation_curve(\n", " SVC(gamma=1e-3), X, y, 'C', Cs, cv=cv)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_validation_curves(Cs, train_scores, test_scores)\n", "plt.ylabel(\"score for SVC(C=C, gamma=1e-3)\")\n", "plt.xlabel(\"C\")\n", "plt.text(1e-3, 0.5, \"Underfitting\", fontsize=16, ha='center', va='bottom')\n", "plt.text(1e3, 0.5, \"Few Overfitting\", fontsize=16, ha='center', va='bottom')\n", "plt.title('Validation curves for the C parameter');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Doing this procedure several for each parameter combination is tedious, hence it's possible to automate the procedure by computing the test score for all possible combinations of parameters using the `GridSearchCV` helper." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.grid_search import GridSearchCV\n", "#help(GridSearchCV)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from pprint import pprint\n", "svc_params = {\n", " 'C': np.logspace(-1, 2, 4),\n", " 'gamma': np.logspace(-4, 0, 5),\n", "}\n", "pprint(svc_params)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As Grid Search is a costly procedure, let's do the some experiments with a smaller dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_subsamples = 500\n", "X_small_train, y_small_train = X_train[:n_subsamples], y_train[:n_subsamples]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=-1)\n", "\n", "%time _ = gs_svc.fit(X_small_train, y_small_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc.best_params_, gs_svc.best_score_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc.grid_scores_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "first_score = gs_svc.grid_scores_[0]\n", "first_score" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "dict(vars(first_score))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's define a couple of helper function to help us introspect the details of the grid search outcome:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def display_scores(params, scores, append_star=False):\n", " \"\"\"Format the mean score +/- std error for params\"\"\"\n", " params = \", \".join(\"{0}={1}\".format(k, v)\n", " for k, v in params.items())\n", " line = \"{0}:\\t{1:.3f} (+/-{2:.3f})\".format(\n", " params, np.mean(scores), sem(scores))\n", " if append_star:\n", " line += \" *\"\n", " return line\n", "\n", "def display_grid_scores(grid_scores, top=None):\n", " \"\"\"Helper function to format a report on a grid of scores\"\"\"\n", " \n", " grid_scores = sorted(grid_scores, key=lambda x: x[1], reverse=True)\n", " if top is not None:\n", " grid_scores = grid_scores[:top]\n", " \n", " # Compute a threshold for staring models with overlapping\n", " # stderr:\n", " _, best_mean, best_scores = grid_scores[0]\n", " threshold = best_mean - 2 * sem(best_scores)\n", " \n", " for params, mean_score, scores in grid_scores:\n", " append_star = mean_score + 2 * sem(scores) > threshold\n", " print(display_scores(params, scores, append_star=append_star))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "display_grid_scores(gs_svc.grid_scores_, top=20)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can see that Support Vector Machine with RBF kernel are very sensitive wrt. the `gamma` parameter (the badwith of the kernel) and to some lesser extend to the `C` parameter as well. If those parameter are not grid searched, the predictive accurracy of the support vector machine is almost no better than random guessing!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the `GridSearchCV` class refits a final model on the complete training set with the best parameters found by during the grid search:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_svc.score(X_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluating this final model on the real test set will often yield a better score because of the larger training set, especially when the training set is small and the number of cross validation folds is small (`cv=3` here)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "\n", "1. Find a set of parameters for an `sklearn.tree.DecisionTreeClassifier` on the `X_small_train` / `y_small_train` digits dataset to reach at least 75% accuracy on the sample dataset (500 training samples)\n", "2. In particular you can grid search good values for `criterion`, `min_samples_split` and `max_depth`\n", "3. Which parameter(s) seems to be the most important to tune?\n", "4. Retry with `sklearn.ensemble.ExtraTreesClassifier(n_estimators=30)` which is a randomized ensemble of decision trees. Does the parameters that make the single trees work best also make the ensemble model work best?\n", "\n", "Hints:\n", "\n", "- If the outcome of the grid search is too instable (overlapping std errors), increase the number of CV folds with `cv` constructor parameter. The default value is `cv=3`. Increasing it to `cv=5` or `cv=10` often yield more stable results but at the price of longer evaluation times.\n", "- Start with a small grid, e.g. 2 values `criterion` and 3 for `min_samples_split` only to avoid having to wait for too long at first.\n", "\n", "Type:\n", "\n", " from sklearn.tree.DecisionTreeClassifier\n", " DecisionTreeClassifier? # to read the docstring and know the list of important parameters\n", " print(DecisionTreeClassifier()) # to show the list of default values\n", "\n", " from sklearn.ensemble.ExtraTreesClassifier\n", " ExtraTreesClassifier? \n", " print(ExtraTreesClassifier())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.tree import DecisionTreeClassifier\n", "DecisionTreeClassifier()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "tree = DecisionTreeClassifier()\n", "\n", "tree_params = {\n", " 'criterion': ['gini', 'entropy'],\n", " 'min_samples_split': [2, 10, 20],\n", " 'max_depth': [5, 7, None],\n", "}\n", "\n", "cv = ShuffleSplit(n_subsamples, n_iter=50, test_size=0.1)\n", "gs_tree = GridSearchCV(tree, tree_params, n_jobs=-1, cv=cv)\n", "\n", "%time gs_tree.fit(X_train[:n_samples], y_train[:n_samples])\n", "display_grid_scores(gs_tree.grid_scores_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the dataset is quite small and decision trees are prone to overfitting, we need cross validate many times (e.g. `n_iter=50`) to get standard error of the mean test score below `0.010`.\n", "\n", "At that level of precision one can observe that the `entropy` split criterion yields slightly better predictions than `gini`. One can also observe that traditional regularization strategies (limiting the depth of the tree or giving a minimum number of samples to allow for a node to split does not work well on this problem.\n", "\n", "Indeed, the unregularized decision tree (`max_depth=None` and `min_samples_split=2`) is among the top performers while it is clearly overfitting:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "unreg_tree = DecisionTreeClassifier(criterion='entropy', max_depth=None,\n", " min_samples_split=2)\n", "unreg_tree.fit(X_small_train, y_small_train)\n", "print(\"Train score: %0.3f\" % unreg_tree.score(X_small_train, y_small_train))\n", "print(\"Test score: %0.3f\" % unreg_tree.score(X_test, y_test))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Limiting the depth to 7 or setting the minimum number of samples to 20: this regularization add as much bias (hence training error) as it removes variance (as measured by the gap between training and test score) hence does not make it possible to solve the overfitting issue efficiently, for instance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "reg_tree = DecisionTreeClassifier(criterion='entropy', max_depth=7,\n", " min_samples_split=10)\n", "reg_tree.fit(X_small_train, y_small_train)\n", "print(\"Train score: %0.3f\" % reg_tree.score(X_small_train, y_small_train))\n", "print(\"Test score: %0.3f\" % reg_tree.score(X_test, y_test))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the grid scores results one can also observe that regularizing too much is clearly detrimental: the models with a depth limited to 5 are clearly inferior to those limited to 7 or not depth limited at all (on this dataset)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To combat overfitting, of decision trees, it is preferable to use an ensemble approach that randomize the learning even further and then average the predictions as we will see with the `ExtraTreesClassifier` model class:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import ExtraTreesClassifier\n", "print(ExtraTreesClassifier())\n", "#ExtraTreesClassifier?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "trees = ExtraTreesClassifier(n_estimators=30)\n", "\n", "cv = ShuffleSplit(n_subsamples, n_iter=5, test_size=0.1)\n", "gs_trees = GridSearchCV(trees, tree_params, n_jobs=-1, cv=cv)\n", "\n", "%time gs_trees.fit(X_small_train, y_small_train)\n", "display_grid_scores(gs_trees.grid_scores_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of remarks:\n", "\n", " - `ExtraTreesClassifier` achieve a much better generalization than individual decistion trees (0.97 vs 0.80) even on such a small dataset so they are indeed able to solve the overfitting issue of individual decision trees.\n", "\n", " - `ExtraTreesClassifier` are much longer to train than individual trees but the fact that the predictions is averaged makes it no necessary to cross validate as many times to reach a stderr on the order of `0.010`.\n", "\n", " - `ExtraTreesClassifier` are very robust to the choice of the parameters: most grid search point achieve a good prediction (even when higly regularized) although too much regularization is harmful. We can also note that the split criterion is no longer relevant." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally one can also observe that despite the high level of randomization of the individual trees, an ensemble model composed of unregularized trees is not underfitting:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "unreg_trees = ExtraTreesClassifier(n_estimators=50, max_depth=None, min_samples_split=2)\n", "unreg_trees.fit(X_small_train, y_small_train)\n", "print(\"Train score: %0.3f\" % unreg_trees.score(X_small_train, y_small_train))\n", "print(\"Test score: %0.3f\" % unreg_trees.score(X_test, y_test))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More interesting, an ensemble model composed of regularized trees is not underfitting much less than the individual regularized trees:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "reg_trees = ExtraTreesClassifier(n_estimators=50, max_depth=7, min_samples_split=10)\n", "reg_trees.fit(X_small_train, y_small_train)\n", "print(\"Train score: %0.3f\" % reg_trees.score(X_small_train, y_small_train))\n", "print(\"Test score: %0.3f\" % reg_trees.score(X_test, y_test))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Plotting Learning Curves for Bias-Variance analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to better understand the behavior of model (model class + contructor parameters), is it possible to run several cross validation steps for various random sub-samples of the training set and then plot the mean training and test errors.\n", "\n", "These plots are called the **learning curves**.\n", "\n", "sklearn does not yet provide turn-key utilities to plot such learning curves but is not very complicated to compute them by leveraging the `ShuffleSplit` class. First let's define a range of data set sizes for subsampling the training set:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "train_sizes = np.logspace(2, 3, 5).astype(np.int)\n", "train_sizes" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each training set sizes we will compute `n_iter` cross validation iterations. Let's pre-allocate the arrays to store the results:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_iter = 20\n", "train_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)\n", "test_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now loop over training set sizes and CV iterations:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc = SVC(C=1, gamma=0.0005)\n", "\n", "for i, train_size in enumerate(train_sizes):\n", " cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=train_size)\n", " for j, (train, test) in enumerate(cv):\n", " svc.fit(X[train], y[train])\n", " train_scores[i, j] = svc.score(X[train], y[train])\n", " test_scores[i, j] = svc.score(X[test], y[test])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now plot the mean scores with error bars that reflect the standard errors of the means:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mean_train = np.mean(train_scores, axis=1)\n", "confidence = sem(train_scores, axis=1) * 2\n", "\n", "plt.fill_between(train_sizes,\n", " mean_train - confidence,\n", " mean_train + confidence,\n", " color = 'b', alpha = .2)\n", "plt.plot(train_sizes, mean_train, 'o-k', c='b', label='Train score')\n", "\n", "mean_test = np.mean(test_scores, axis=1)\n", "confidence = sem(test_scores, axis=1) * 2\n", "\n", "plt.fill_between(train_sizes,\n", " mean_test - confidence,\n", " mean_test + confidence,\n", " color = 'g', alpha = .2)\n", "plt.plot(train_sizes, mean_test, 'o-k', c='g', label='Test score')\n", "\n", "plt.xlabel('Training set size')\n", "plt.ylabel('Score')\n", "plt.xlim(0, X_train.shape[0])\n", "plt.ylim((None, 1.01)) # The best possible score is 1.0\n", "plt.legend(loc='best')\n", "\n", "plt.text(250, 0.9, \"Overfitting a lot\", fontsize=16, ha='center', va='bottom')\n", "plt.text(800, 0.9, \"Overfitting a little\", fontsize=16, ha='center', va='bottom')\n", "plt.title('Main train and test scores +/- 2 standard errors');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: learning curves can be computed with there own utility function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.learning_curve import learning_curve" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Interpreting Learning Curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- If the **training set error is high** (e.g. more than 5% misclassification) at the end of the learning curve, the model suffers from **high bias** and is said to **underfit** the training set.\n", "\n", "- If the **testing set error is significantly larger than the training set error**, the model suffers from **high variance** and is said to **overfit** the training set.\n", "\n", "Another possible source of high training and testing error is label noise: the data is too noisy and there is nothing few signal learn from it." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "What to do against overfitting?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Try to get rid of noisy features using **feature selection** methods (or better let the model do it if the regularization is able to do so: for instance l1 penalized linear models)\n", "- Try to tune parameters to add **more regularization**:\n", " - Smaller values of `C` for SVM\n", " - Larger values of `alpha` for penalized linear models\n", " - Restrict to shallower trees (decision stumps) and lower numbers of samples per leafs for tree-based models\n", "- Try **simpler model families** such as penalized linear models (e.g. Linear SVM, Logistic Regression, Naive Bayes)\n", "- Try the ensemble strategies that **average several independently trained models** (e.g. bagging or blending ensembles): average the predictions of independently trained models\n", "- Collect more **labeled samples** if the learning curves of the test score has a non-zero slope on the right hand side." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "What to do against underfitting?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Give **more freedom** to the model by relaxing some parameters that act as regularizers:\n", " - Larger values of `C` for SVM\n", " - Smaller values of `alpha` for penalized linear models\n", " - Allow deeper trees and lower numbers of samples per leafs for tree-based models\n", "- Try **more complex / expressive model families**:\n", " - Non linear kernel SVMs,\n", " - Ensemble of Decision Trees...\n", "- **Construct new features**:\n", " - bi-gram frequencies for text classifications\n", " - feature cross-products (possibly using the hashing trick)\n", " - unsupervised features extraction (e.g. triangle k-means, auto-encoders...)\n", " - non-linear kernel approximations + linear SVM instead of simple linear SVM" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Final Model Assessment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grid Search parameters tuning can it-self be considered a (meta-)learning algorithm. Hence there is a risk of not taking into account the **overfitting of the grid search procedure** it-self.\n", "\n", "To quantify and mitigate this risk we can nest the train / test split concept one level up:\n", " \n", "Maker a top level \"Development / Evaluation\" sets split:\n", " \n", "- Development set used for Grid Search and training of the model with optimal parameter set\n", "- Hold out evaluation set used **only** for estimating the predictive performance of the resulting model\n", "\n", "For dataset sampled over time, it is **highly recommended to use a temporal split** for the Development / Evaluation split: for instance, if you have collected data over the 2008-2013 period, you can:\n", " \n", "- use 2008-2011 for development (grid search optimal parameters and model class),\n", "- 2012-2013 for evaluation (compute the test score of the best model parameters)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "One Final Note About kernel SVM Parameters Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this session we applied the SVC model with RBF kernel on unormalized features: this is bad! If we had used a normalizer, the default parameters for `C` and `gamma` of SVC would directly have led to close to optimal performance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "X_train_scaled = scaler.fit_transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)\n", "\n", "clf = SVC().fit(X_train_scaled, y_train) # Look Ma'! Default params!\n", "print(\"Train score: {0:.3f}\".format(clf.score(X_train_scaled, y_train)))\n", "print(\"Test score: {0:.3f}\".format(clf.score(X_test_scaled, y_test)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is because once normalized, the digits is very regular and fits the assumptions of the default parameters of the `SVC` class very well. This is rarely the case though and usually it's always necessary to grid search the parameters.\n", "\n", "Nonetheless, **scaling should be a mandatory preprocessing step when using SVC, especially with a RBF kernel**." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }