{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The scikit-learn interface" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import scipy.io\n", "import numpy as np\n", "import pandas as pd\n", "from addutils import css_notebook\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the estimator object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In scikit-learn, almost all operations are done through an **estimator object**. For example, a linear regression estimator can be instantiated as follows:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)\n" ] } ], "source": [ "from sklearn import linear_model\n", "model = linear_model.LinearRegression(fit_intercept=True, normalize=True)\n", "print(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In brackets are displayed the current values for the “hyperparameters” of the estimator. To learn about the specific “hyperparameters” check the documentation:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Try: model?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Meta parameters can be changed after the model has been created:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=True)\n" ] } ], "source": [ "model.fit_intercept = False\n", "print(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a scikit-learn *estimator* object named `model`, the following methods are available:\n", "\n", "* *Available in all Estimators*\n", " * `model.fit()` : fit training data.\n", " * For supervised learning, this accepts two arguments: `model.fit(X, y)`.\n", " * For unsupervised learning, this accepts only a single argument `model.fit(X)`.\n", "* *Available in supervised estimators*\n", " * `model.predict()` : predict the label of a new set of data. This accepts one argument `model.predict(X_new)`).\n", " * `model.predict_proba()`: Returns the probability of a categorical label. The label itself is returned by `model.predict()`.\n", " * `model.score()`: Scores are between 0 and 1, with a larger score indicating a better fit.\n", "* *Available in unsupervised estimators*\n", " * `model.transform()`: Transform new data into the new basis. This accepts one argument `X_new`, and returns the new representation of the data.\n", " * `model.fit_transform()`: some estimators implement this method, which more efficiently performs a fit and a transform on the same input data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Simple estimator example: fit a linear regression model" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", "\n", "
\n", " \n", " BokehJS successfully loaded.\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import bokeh.plotting as bk\n", "bk.output_notebook()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)\n", "Model coefficient: 275.99550, and intercept: -145.41186\n", "Mean Squared Error: 23.04\n" ] }, { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn import datasets, preprocessing, metrics\n", "X, y = datasets.samples_generator.make_regression(n_samples=70,\n", " n_features=1, n_informative=1,\n", " random_state=0, noise=5)\n", "scaler = preprocessing.MinMaxScaler()\n", "X_sc = scaler.fit_transform(X)\n", "\n", "lin = linear_model.LinearRegression(fit_intercept=True)\n", "lin.fit(X_sc, y)\n", "\n", "print(lin)\n", "print(\"Model coefficient: %.5f, and intercept: %.5f\" % (lin.coef_, lin.intercept_))\n", "err = metrics.mean_squared_error(lin.predict(X_sc), y)\n", "print(\"Mean Squared Error: %.2f\" % err)\n", "\n", "# Plot the data and the model prediction\n", "X_p = np.linspace(0, 1, 2)[:, np.newaxis]\n", "y_p = lin.predict(X_p)\n", "\n", "fig = bk.figure(title='Simple Regression', \n", " x_axis_label='X scaled',\n", " y_axis_label='y',\n", " plot_width=600, plot_height=300)\n", "fig.circle(X_sc.squeeze(), y, line_color='darkgreen', size=10,\n", " fill_color='green', fill_alpha=0.5, legend='Measured Data')\n", "fig.line(X_p.ravel(), y_p, line_color='blue', legend='Predicted Values')\n", "fig.legend.orientation = 'bottom_right'\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Separate Training and Validation Sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Fitting a model and testing it on the same data is a methodological mistake:** a model could have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice to hold out part of the available data as a **Validation Set** `X_valid, y_valid`.\n", "\n", "In scikit-learn a random split into **Training** and **Validation** sets can be quickly computed with the `train_test_split` helper function:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error: 23.34\n", "Model coefficient: 273.00288, and intercept: -143.70277\n" ] }, { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn import cross_validation\n", "X_tr, X_valid, y_tr, y_valid = cross_validation.train_test_split(X_sc, y,\n", " test_size=0.30, random_state=0)\n", "lin.fit(X_tr, y_tr)\n", "MSE = metrics.mean_squared_error(lin.predict(X_valid), y_valid)\n", "print(\"Mean Squared Error: %.2f\" % MSE)\n", "print(\"Model coefficient: %.5f, and intercept: %.5f\" % (lin.coef_, lin.intercept_))\n", "\n", "X_predicted = np.linspace(0, 1, 2)[:, np.newaxis]\n", "y_predicted = lin.predict(X_predicted)\n", "fig = bk.figure(title='Training and Validation Sets', \n", " plot_width=700, plot_height=400)\n", "fig.circle(X_tr.squeeze(), y_tr,\n", " size=5, line_alpha=0.85, fill_color='green', line_color='darkgreen',\n", " legend='Training Set')\n", "fig.circle(X_valid.squeeze(), y_valid, \n", " size=5, line_alpha=0.85, fill_color='orangered', legend='Validation Set')\n", "fig.line(X_predicted.ravel(), y_predicted, line_color='blue', legend='Linear Regression')\n", "fig.legend.orientation = 'bottom_right'\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Example: Do a Regression Analysis on MATLAB® data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data are generated with the following MATLAB code:\n", "\n", " % Generate Regression Test Data 02 ----------------------------------------\n", " samples = 100; features = 5;\n", " \n", " X = rand(samples, features)\n", " for i = 1:samples\n", " X(i,:) = (X(i,:)*i)+i+0.1;\n", " end\n", " % Calculate y as a linear combination of features with coeff. 1.5, 2.5, ...\n", " % and add some noise\n", " noise = 0.1;\n", " lin_comb = (1:features) + 0.5\n", " y = (X+rand(samples, features)*noise)*lin_comb'\n", " % Define feature names as 'F001', ... , 'Fnnn' up to 9999 features\n", " feat_names = sprintf('F%04i',1);\n", " for i = 2:features\n", " feat_names = strvcat(feat_names, sprintf('F%04i',i));\n", " end\n", " \n", " save ('matlab_test_data_02', 'X','y', 'feat_names')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import preprocessing, cross_validation, linear_model\n", "mat = scipy.io.loadmat('example_data/matlab_test_data_02.mat')\n", "col = [s.strip() for s in list(mat['feat_names'])]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Xt,Xv,yt,yv = cross_validation.train_test_split(mat['X'], mat['y'],test_size=0.30,\n", " random_state=0)\n", "Xt = pd.DataFrame(Xt, columns=col)\n", "Xv = pd.DataFrame(Xv, columns=col)\n", "yt = pd.DataFrame(yt, columns=['measured'])\n", "yv = pd.DataFrame(yv, columns=['measured'])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
F0001F0002F0003F0004F0005F0006
0131.391611178.588071191.836028206.928031188.601869195.879822
1137.77163091.46264376.438656104.585583123.26509073.587835
2789.737369671.934539835.830887515.921167901.167154813.659391
\n", "
" ], "text/plain": [ " F0001 F0002 F0003 F0004 F0005 F0006\n", "0 131.391611 178.588071 191.836028 206.928031 188.601869 195.879822\n", "1 137.771630 91.462643 76.438656 104.585583 123.265090 73.587835\n", "2 789.737369 671.934539 835.830887 515.921167 901.167154 813.659391" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xt.head(3)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Scaler is fitted only on training data\n", "Xts = Xt.copy()\n", "Xvs = Xv.copy()\n", "scaler = preprocessing.StandardScaler(copy=False, with_mean=True, with_std=True).fit(Xt)\n", "scaler.transform(Xts)\n", "scaler.transform(Xv);" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
F0001F0002F0003F0004F0005F0006
0-1.294851-1.210162-1.171325-1.148142-1.198971-1.160278
1-1.281377-1.399557-1.416333-1.371187-1.344097-1.424161
20.095500-0.1377160.195982-0.4747230.3837820.172776
\n", "
" ], "text/plain": [ " F0001 F0002 F0003 F0004 F0005 F0006\n", "0 -1.294851 -1.210162 -1.171325 -1.148142 -1.198971 -1.160278\n", "1 -1.281377 -1.399557 -1.416333 -1.371187 -1.344097 -1.424161\n", "2 0.095500 -0.137716 0.195982 -0.474723 0.383782 0.172776" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xts.head(3)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 708.82319504 1153.80993668 1647.24008556 2066.72663569\n", " 2474.24236679 3011.70541794]]\n", "[ 17692.80536173]\n" ] } ], "source": [ "model = linear_model.LinearRegression(fit_intercept=True, normalize=True)\n", "model.fit(Xts, yt)\n", "# NOTICE: coefficients are much different from 1.5, 2.5, ... when fitting on scaled data\n", "print(model.coef_)\n", "print(model.intercept_)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the data and the model prediction\n", "yp = model.predict(Xvs)\n", "fig = bk.figure(title='Measured Values VS Predicted Values',\n", " x_axis_label='y val', y_axis_label='y pred',\n", " plot_width=600, plot_height=400)\n", "fig.circle(yv['measured'], yp[:,0], size=8,\n", " fill_color='blue', fill_alpha=0.5, line_color='black', alpha=0.2)\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Example: Training and a Validation Sets on a Classification Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The K-neighbors is an instance-based classifier that predicts the label of an unknown point based on the labels of the *K* nearest points in the parameter space:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[178 0 0 0 0 0 0 0 0 0]\n", " [ 0 182 0 0 0 0 0 0 0 0]\n", " [ 0 0 177 0 0 0 0 0 0 0]\n", " [ 0 0 0 183 0 0 0 0 0 0]\n", " [ 0 0 0 0 181 0 0 0 0 0]\n", " [ 0 0 0 0 0 182 0 0 0 0]\n", " [ 0 0 0 0 0 0 181 0 0 0]\n", " [ 0 0 0 0 0 0 0 179 0 0]\n", " [ 0 0 0 0 0 0 0 0 174 0]\n", " [ 0 0 0 0 0 0 0 0 0 180]]\n", "\n", "F1 Score: 1.0\n" ] } ], "source": [ "from sklearn import neighbors\n", "digits = datasets.load_digits()\n", "X, y = digits.data, digits.target\n", "\n", "clf = neighbors.KNeighborsClassifier(n_neighbors=1)\n", "clf.fit(X, y)\n", "y_pred = clf.predict(X)\n", "print(metrics.confusion_matrix(y_pred, y))\n", "print('\\nF1 Score: ', metrics.f1_score(y_pred, y, average='weighted'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, we've found a perfect classifier! But this is misleading for the reasons we saw before: the classifier essentially \"memorizes\" all the samples it has already seen. To really test how well this algorithm does, we need to try some samples it *hasn't* yet seen.\n", "\n", "Here we split the original data in **Training** and **Validation** sets and run again the previous algorithm. In this case we see that we still have a good classifier but the precision is reduced:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[60 0 0 0 0 0 0 0 0 0]\n", " [ 0 73 0 0 0 0 0 0 0 0]\n", " [ 0 0 71 0 0 0 0 0 0 0]\n", " [ 0 0 0 70 0 0 0 0 0 0]\n", " [ 0 0 0 0 63 0 0 0 0 0]\n", " [ 0 0 0 0 0 87 1 0 0 1]\n", " [ 0 0 0 0 0 0 76 0 0 0]\n", " [ 0 0 0 0 0 0 0 65 0 0]\n", " [ 0 2 0 1 0 0 0 0 74 1]\n", " [ 0 0 0 2 0 1 0 0 0 71]]\n", "\n", "F1 Score: 0.987441080752\n" ] } ], "source": [ "X_tr, X_valid, y_tr, y_valid = cross_validation.train_test_split(X, \n", " y,\n", " test_size=0.40, \n", " random_state=0)\n", "clf = neighbors.KNeighborsClassifier(n_neighbors=1).fit(X_tr, y_tr)\n", "y_pred = clf.predict(X_valid)\n", "print(metrics.confusion_matrix(y_valid, y_pred))\n", "print('\\nF1 Score: ', metrics.f1_score(y_valid, y_pred, average='weighted'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Cross Validation (CV)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By partitioning the available data into **Training** and **Validation** sets, we reduce the number of samples which can be used for learning the model.\n", "\n", "By using cross-validation **(CV)** the *validation set is defined automatically by the CV algoritm. * The training set is split into k smaller sets and:\n", "\n", "* A model is trained using k-1 of the folds as training data\n", "* The resulting model is validated on the remaining part of the data\n", "\n", "In this example we first calculate the Mean Squared Error on the Test DatasetSet (which is a methodological error) and then we calculate again the same MSE using a KFold CV. The MSE calculated with Cv is much larger:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE Calculated on the Training Set: 19.5876545263\n", "MSE Calculated with Cross Validation: -49.85 (+/- 80.06)\n" ] } ], "source": [ "X, y = datasets.samples_generator.make_regression(n_samples=7,\n", " n_features=1, n_informative=1,\n", " random_state=0, noise=5)\n", "lin = linear_model.LinearRegression(fit_intercept=True)\n", "lin.fit(X, y)\n", "MSE = metrics.mean_squared_error(lin.predict(X), y)\n", "print('MSE Calculated on the Training Set: ', MSE)\n", "MSE_CV = cross_validation.cross_val_score(lin, X, y, cv=5, scoring='mean_squared_error')\n", "print(\"MSE Calculated with Cross Validation: %0.2f (+/- %0.2f)\"\n", " % (MSE_CV.mean(), MSE_CV.std() * 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the `cv` argument is an integer like `cv=10`, `cross_val_score` uses the `KFold` or `StratifiedKFold` strategies. It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance we can use `cross_validation.ShuffleSplit` where `n_iter` is the number of re-shuffling and splitting operations:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE Calculated with Cross Validation: -33.84 (+/- 36.87)\n" ] } ], "source": [ "n_samples = X.shape[0]\n", "cv_iter = cross_validation.ShuffleSplit(n_samples, n_iter=6, test_size=0.2, random_state=0)\n", "MSE_CV = cross_validation.cross_val_score(lin, X, y, cv=cv_iter, scoring='mean_squared_error')\n", "print(\"MSE Calculated with Cross Validation: %0.2f (+/- %0.2f)\"\n", " % (MSE_CV.mean(), MSE_CV.std() * 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Cross Validation: test many estimators on the same dataset:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example *we use the \"diabetes\" dataset* to fit tree different linear models. For each model the default CV score is calculated and displayed:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression : 0.488702767247\n", "Ridge : 0.409427438303\n", "Lasso : 0.353800083299\n" ] } ], "source": [ "data = datasets.load_diabetes()\n", "X, y = data.data, data.target\n", "\n", "for Model in [linear_model.LinearRegression,\n", " linear_model.Ridge,\n", " linear_model.Lasso]:\n", " model = Model()\n", " print(Model.__name__, ': ', cross_validation.cross_val_score(model, X, y).mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Cross Validation: test many hyperparamaters and estimators on the same dataset:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Lasso` and `Ridge` accept a regularization parameter `alpha`. Here we plot the CV Score for different values of alpha" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from bokeh.models.axes import LogAxis\n", "\n", "alphas = np.logspace(-4, -1, 200)\n", "fig = bk.figure(title='Alpha sensitiveness on different models',\n", " plot_width=800, plot_height=400,\n", " x_axis_type='log', \n", " x_range=(min(alphas), max(alphas)))\n", "\n", "for Model, color in [(linear_model.Ridge, 'blue'), (linear_model.Lasso, 'green')]:\n", " scores = [cross_validation.cross_val_score(Model(alpha=a), X, y, cv=5).mean()\n", " for a in alphas]\n", " fig.line(alphas, scores, line_color=color, legend=Model.__name__)\n", "\n", "fig.legend.orientation = 'bottom_left'\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Model specific Cross Validation:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some models can fit data for a range of parameters almost as efficiently as fitting for a single parameter. The most common parameter amenable to this strategy is the strength of the regularizer. In this case we say that we compute the regularization path of the estimator.\n", "\n", "This is a particular case of what we are going to see in the next paragraph: **Grid Search**.\n", "\n", "Model Specific Cross Validation is supported by the following models:\n", "\n", "* **RidgeCV**\n", "* **RidgeClassifierCV**\n", "* **LarsCV**\n", "* **LassoLarsCV**\n", "* **LassoCV**\n", "* **ElasticNetCV**" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RidgeCV : 0.000373993730248\n", "LassoCV : 0.00382749447852\n" ] } ], "source": [ "for Model in [linear_model.RidgeCV, linear_model.LassoCV]:\n", " model = Model(alphas=alphas, cv=5).fit(X, y)\n", " print(Model.__name__, ': ', model.alpha_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Cross-validation iterators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **K-fold** - divide all samples in *K* groups, if *K=n* it is equivalent to leave-one-out.\n", "* **Statified K-fold** - each set contains the same prercentage of the target classes as the whole set\n", "* **Leave-One-Out (LOO)** - all samples except one\n", "* **Leave-P-Out (LPO)** - create all possible training sets by removing *p* samples from the whole set\n", "* **Leave-One-Label-Out (LOLO)** - holds out the samples according to a third-party provided label\n", "* **ShuffleSplit** - shuffles than split data in train and test sets\n", "* **StratifiedShuffleSplit** - ShuffleSplits by preserving percentage\n", "* **Bootstrap** - generate independent splits and then resample with replacement each split" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = np.arange(20).reshape(10,2)\n", "y = np.arange(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**KFold** split dataset into k consecutive folds (without shuffling):" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2 3 4 5 6 7 8 9] [0 1]\n", "[0 1 4 5 6 7 8 9] [2 3]\n", "[0 1 2 3 6 7 8 9] [4 5]\n", "[0 1 2 3 4 5 8 9] [6 7]\n", "[0 1 2 3 4 5 6 7] [8 9]\n" ] } ], "source": [ "kf = cross_validation.KFold(len(y), n_folds=5)\n", "for train, test in kf:\n", " print(train, test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**ShuffleSplit** do not guarantee that all folds will be different:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 4 7 5 3 1 2 8] [9 6]\n", "[5 2 7 9 3 8 0 1] [4 6]\n", "[2 1 9 7 8 4 6 0] [5 3]\n", "[3 2 9 6 0 5 8 1] [7 4]\n", "[8 7 5 3 6 0 1 9] [4 2]\n" ] } ], "source": [ "ss = cross_validation.ShuffleSplit(len(y), n_iter=5, test_size=0.2)\n", "for train, test in ss:\n", " print(train, test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bootstrap** example:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 2 3 4 5 6 7 8 9] [0]\n", "[0 2 3 4 5 6 7 8 9] [1]\n", "[0 1 3 4 5 6 7 8 9] [2]\n", "[0 1 2 4 5 6 7 8 9] [3]\n", "[0 1 2 3 5 6 7 8 9] [4]\n", "[0 1 2 3 4 6 7 8 9] [5]\n", "[0 1 2 3 4 5 7 8 9] [6]\n", "[0 1 2 3 4 5 6 8 9] [7]\n", "[0 1 2 3 4 5 6 7 9] [8]\n", "[0 1 2 3 4 5 6 7 8] [9]\n" ] } ], "source": [ "bs = cross_validation.LeaveOneLabelOut(range(len(y)))\n", "for train, test in bs:\n", " print(train, test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Grid Search: Searching for estimator hyperparameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`GridSearchCV` can optimize any parameter provided by `estimator.get_params()` by exhaustively fitting the model with any parameter combination.\n", "\n", "When evaluating different settings (“hyperparameters”) for estimators.\n", "\n", "`GridSearchCV` will optimize the hyperparameters on the **TRAINING set** (whereas the **VALIDATION set** is generated by CV). After the optimization has been done, the performances of the estimator can be evaluated on a third set called **TEST set**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Exhaustive Grid Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll pass to `GridSearchCV` a dictionary of parameters containing a list of hyperparameters to be searched. In this case we set `verbose=2` to see an output about all the searches done by the algorithm:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 2 folds for each of 4 candidates, totalling 8 fits\n", "[CV] C=1, kernel=linear ..............................................\n", "[CV] C=1, kernel=linear ..............................................\n", "[CV] C=10, kernel=linear .............................................\n", "[CV] C=10, kernel=linear .............................................\n", "[CV] ..................................... C=1, kernel=linear - 0.0s\n", "[CV] ..................................... C=1, kernel=linear - 0.0s\n", "[CV] .................................... C=10, kernel=linear - 0.0s\n", "[CV] .................................... C=10, kernel=linear - 0.0s\n", "[CV] gamma=0.1, C=2, kernel=rbf ......................................\n", "[CV] gamma=0.1, C=2, kernel=rbf ......................................\n", "[CV] gamma=0.01, C=2, kernel=rbf .....................................\n", "[CV] gamma=0.01, C=2, kernel=rbf .....................................\n", "[CV] ............................. gamma=0.1, C=2, kernel=rbf - 0.0s\n", "[CV] ............................. gamma=0.1, C=2, kernel=rbf - 0.0s\n", "[CV] ............................ gamma=0.01, C=2, kernel=rbf - 0.0s\n", "[CV] ............................ gamma=0.01, C=2, kernel=rbf - 0.0s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 0.0s\n", "[Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 0.0s remaining: 0.1s\n", "[Parallel(n_jobs=-1)]: Done 7 out of 8 | elapsed: 0.0s remaining: 0.0s\n", "[Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 0.0s finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=2, error_score='raise',\n", " estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,\n", " kernel='rbf', max_iter=-1, probability=False, random_state=None,\n", " shrinking=True, tol=0.001, verbose=False),\n", " fit_params={}, iid=True, loss_func=None, n_jobs=-1,\n", " param_grid=[{'C': [1, 10], 'kernel': ['linear']}, {'C': [2], 'kernel': ['rbf'], 'gamma': [0.1, 0.01]}],\n", " pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,\n", " verbose=2)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import svm, grid_search\n", "# Standardize the data\n", "iris = datasets.load_iris()\n", "scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)\n", "X = scaler.fit_transform(iris.data)\n", "# Keep out the Test Set\n", "X_tr, X_test, y_tr, y_test = cross_validation.train_test_split(X, iris.target,\n", " test_size=0.30, random_state=0)\n", "# Define the Search Grid\n", "param_grid = [{'C': [1, 10], 'kernel': ['linear']},\n", " {'C': [2], 'gamma': [0.1, 0.01], 'kernel': ['rbf']}]\n", "# GridSearchCV\n", "clf = grid_search.GridSearchCV(svm.SVC(), param_grid, cv=2, n_jobs=-1, verbose=2)\n", "clf.fit(X_tr, y_tr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TODO -** Create a Surf Plot with the accuracy values found on an hyperparameter grid using the .grid_scores_ data" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "{'C': 1, 'kernel': 'linear'}\n", "1\n", "\n", "{'C': 10, 'kernel': 'linear'}\n", "10\n", "\n", "{'gamma': 0.1, 'C': 2, 'kernel': 'rbf'}\n", "2\n", "\n", "{'gamma': 0.01, 'C': 2, 'kernel': 'rbf'}\n", "2\n" ] } ], "source": [ "for item in clf.grid_scores_:\n", " print(item.index)\n", " print(item.parameters)\n", " print(item.parameters['C'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can check the accuracy on the **Test Set**:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best estimator: SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,\n", " kernel='linear', max_iter=-1, probability=False, random_state=None,\n", " shrinking=True, tol=0.001, verbose=False)\n", "Accuracy: 0.91 (+/- 0.22)\n" ] } ], "source": [ "best = clf.best_estimator_\n", "print('Best estimator: ', best)\n", "scores = cross_validation.cross_val_score(best, X_test, y_test, cv=10)\n", "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different `score` metrics can be used while doing `GridSearchCV`:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PRECISION: fraction of retrieved instances that are relevant\n", "RECALL: fraction of relevant instances that are retrieved\n", "\n", "Tuning hyper-parameters for: precision_weighted\n", "\n", "Best parameters set found on development (training) set:\n", "SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,\n", " kernel='linear', max_iter=-1, probability=False, random_state=None,\n", " shrinking=True, tol=0.001, verbose=False)\n", "\n", "Grid Scores on development set:\n", "mean: 0.92000, std: 0.08354, params: {'C': 0.125, 'kernel': 'linear'}\n", "mean: 0.95333, std: 0.06504, params: {'C': 0.25, 'kernel': 'linear'}\n", "mean: 0.98667, std: 0.02667, params: {'C': 0.5, 'kernel': 'linear'}\n", "mean: 0.97417, std: 0.03480, params: {'C': 1.0, 'kernel': 'linear'}\n", "mean: 0.94083, std: 0.06132, params: {'C': 1, 'kernel': 'rbf', 'gamma': 0.1}\n", "mean: 0.97417, std: 0.03480, params: {'C': 10, 'kernel': 'rbf', 'gamma': 0.1}\n", "\n", "Detailed classification report (on Testing Set)\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 26\n", " 1 0.89 1.00 0.94 33\n", " 2 1.00 0.87 0.93 31\n", "\n", "avg / total 0.96 0.96 0.96 90\n", "\n", "-----------------------------------------------------\n", "\n", "Tuning hyper-parameters for: recall_weighted\n", "\n", "Best parameters set found on development (training) set:\n", "SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,\n", " kernel='linear', max_iter=-1, probability=False, random_state=None,\n", " shrinking=True, tol=0.001, verbose=False)\n", "\n", "Grid Scores on development set:\n", "mean: 0.91667, std: 0.08287, params: {'C': 0.125, 'kernel': 'linear'}\n", "mean: 0.95000, std: 0.06607, params: {'C': 0.25, 'kernel': 'linear'}\n", "mean: 0.98333, std: 0.03333, params: {'C': 0.5, 'kernel': 'linear'}\n", "mean: 0.96667, std: 0.04522, params: {'C': 1.0, 'kernel': 'linear'}\n", "mean: 0.93333, std: 0.06338, params: {'C': 1, 'kernel': 'rbf', 'gamma': 0.1}\n", "mean: 0.96667, std: 0.04522, params: {'C': 10, 'kernel': 'rbf', 'gamma': 0.1}\n", "\n", "Detailed classification report (on Testing Set)\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 26\n", " 1 0.89 1.00 0.94 33\n", " 2 1.00 0.87 0.93 31\n", "\n", "avg / total 0.96 0.96 0.96 90\n", "\n", "-----------------------------------------------------\n" ] } ], "source": [ "X_tr, X_test, y_tr, y_test = cross_validation.train_test_split(X, \n", " iris.target,\n", " test_size=0.60, \n", " random_state=0)\n", "scores = ['precision_weighted', 'recall_weighted']\n", "param_grid = [{'C': np.logspace(-3,0,4, base=2), 'kernel': ['linear']},\n", " {'C': [1, 10], 'gamma': [0.1], 'kernel': ['rbf']}]\n", "print('PRECISION: fraction of retrieved instances that are relevant')\n", "print('RECALL: fraction of relevant instances that are retrieved')\n", "for score in scores:\n", " print('\\nTuning hyper-parameters for: %s' % score)\n", " clf = grid_search.GridSearchCV(svm.SVC(), param_grid, cv=5, scoring=score)\n", " clf.fit(X_tr, y_tr)\n", " print('\\nBest parameters set found on development (training) set:')\n", " print(clf.best_estimator_)\n", " print('\\nGrid Scores on development set:')\n", " for item in clf.grid_scores_:\n", " print(item)\n", " print('\\nDetailed classification report (on Testing Set)')\n", " y_true, y_pred = y_test, clf.predict(X_test)\n", " print(metrics.classification_report(y_true, y_pred))\n", " print('-----------------------------------------------------')\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Randomized Parameter Optimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`RandomizedSearchCV` implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits:\n", "\n", "* A budget can be chosen independent of the number of parameters and possible values.\n", "* Adding parameters that do not influence the performance does not decrease efficiency.\n", "\n", "Specifying how parameters should be sampled is done using a dictionary. The number of sampling iterations, is specified using the `n_iter` parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified. This is a little example of the many statistical distributions available in `scipy`:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import scipy.stats\n", "from bokeh.charts import Histogram\n", "\n", "w, h = 350, 300\n", "\n", "dists = [\n", " scipy.stats.expon(scale=100), # the frozed uniform distribution\n", " scipy.stats.norm(loc=40, scale=10), # the frozed uniform distribution\n", " scipy.stats.uniform(loc=20, scale=100), # the frozed uniform distribution\n", " scipy.stats.beta(2,2,loc=20, scale=100) # the frozed uniform distribution\n", "]\n", "\n", "figs = [ Histogram(dist.rvs(1e4), bins=50, \n", " width=w, height=h, \n", " xlabel=None, ylabel=None)\n", " for dist in dists ]\n", "\n", "bk.show(bk.gridplot([[figs[0], figs[1]],\n", " [figs[2], figs[3]] \n", " ]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }