{ "metadata": { "name": "", "signature": "sha256:651264e17dd163dff203c8bd4c60da5cad3bc2aca2945797ed2b57b6a59ae4e3" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross-validation for parameter tuning, model selection, and feature selection\n", "*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Agenda\n", "\n", "- What is the drawback of using the **train/test split** procedure for model evaluation?\n", "- How does **K-fold cross-validation** overcome this limitation?\n", "- How can cross-validation be used for selecting **tuning parameters**, choosing between **models**, and selecting **features**?\n", "- What are some possible **improvements** to cross-validation?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review of model evaluation procedures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Motivation:** Need a way to choose between machine learning models\n", "\n", "- Goal is to estimate likely performance of a model on **out-of-sample data**\n", "\n", "**Initial idea:** Train and test on the same data\n", "\n", "- But, maximizing **training accuracy** rewards overly complex models which **overfit** the training data\n", "\n", "**Alternative idea:** Train/test split\n", "\n", "- Split the dataset into two pieces, so that the model can be trained and tested on **different data**\n", "- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance\n", "- But, it provides a **high variance** estimate since changing which observations happen to be in the testing set can significantly change testing accuracy" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_iris\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn import metrics" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "# read in the iris data\n", "iris = load_iris()\n", "\n", "# create X (features) and y (response)\n", "X = iris.data\n", "y = iris.target" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# use train/test split with different random_state values\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)\n", "\n", "# check classification accuracy of KNN with K=5\n", "knn = KNeighborsClassifier(n_neighbors=5)\n", "knn.fit(X_train, y_train)\n", "y_pred = knn.predict(X_test)\n", "print metrics.accuracy_score(y_test, y_pred)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.973684210526\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?\n", "\n", "**Answer:** That's the essense of cross-validation!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Steps for K-fold cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Split the dataset into K **equal** partitions (or \"folds\").\n", "2. Use fold 1 as the **testing set** and the union of the other folds as the **training set**.\n", "3. Calculate **testing accuracy**.\n", "4. Repeat steps 2 and 3 K times, using a **different fold** as the testing set each time.\n", "5. Use the **average testing accuracy** as the estimate of out-of-sample accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Diagram of **5-fold cross-validation:**\n", "\n", "![5-fold cross-validation](images/cross_validation_diagram.png)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# simulate splitting a dataset of 25 observations into 5 folds\n", "from sklearn.cross_validation import KFold\n", "kf = KFold(25, n_folds=5, shuffle=False)\n", "\n", "# print the contents of each training and testing set\n", "print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')\n", "for iteration, data in enumerate(kf, start=1):\n", " print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Iteration Training set observations Testing set observations\n", " 1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4] \n", " 2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9] \n", " 3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14] \n", " 4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19] \n", " 5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24] \n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Dataset contains **25 observations** (numbered 0 through 24)\n", "- 5-fold cross-validation, thus it runs for **5 iterations**\n", "- For each iteration, every observation is either in the training set or the testing set, **but not both**\n", "- Every observation is in the testing set **exactly once**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing cross-validation to train/test split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Advantages of **cross-validation:**\n", "\n", "- More accurate estimate of out-of-sample accuracy\n", "- More \"efficient\" use of data (every observation is used for both training and testing)\n", "\n", "Advantages of **train/test split:**\n", "\n", "- Runs K times faster than K-fold cross-validation\n", "- Simpler to examine the detailed results of the testing process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation recommendations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. K can be any number, but **K=10** is generally recommended\n", "2. For classification problems, **stratified sampling** is recommended for creating the folds\n", " - Each response class should be represented with equal proportions in each of the K folds\n", " - scikit-learn's `cross_val_score` function does this by default" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation example: parameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Select the best tuning parameters (aka \"hyperparameters\") for KNN on the iris dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import cross_val_score" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)\n", "knn = KNeighborsClassifier(n_neighbors=5)\n", "scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')\n", "print scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 1. 0.93333333 1. 1. 0.86666667 0.93333333\n", " 0.93333333 1. 1. 1. ]\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# use average accuracy as an estimate of out-of-sample accuracy\n", "print scores.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.966666666667\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# search for an optimal value of K for KNN\n", "k_range = range(1, 31)\n", "k_scores = []\n", "for k in k_range:\n", " knn = KNeighborsClassifier(n_neighbors=k)\n", " scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')\n", " k_scores.append(scores.mean())\n", "print k_scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0.95999999999999996, 0.95333333333333337, 0.96666666666666656, 0.96666666666666656, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.97333333333333338, 0.96666666666666679, 0.96666666666666679, 0.97333333333333338, 0.98000000000000009, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.98000000000000009, 0.97333333333333338, 0.98000000000000009, 0.96666666666666656, 0.96666666666666656, 0.97333333333333338, 0.95999999999999996, 0.96666666666666656, 0.95999999999999996, 0.96666666666666656, 0.95333333333333337, 0.95333333333333337, 0.95333333333333337]\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)\n", "plt.plot(k_range, k_scores)\n", "plt.xlabel('Value of K for KNN')\n", "plt.ylabel('Cross-Validated Accuracy')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAEPCAYAAACDTflkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm8lNWd5/HPFwRlcYEgi4CCCoqggCgCxuQaoyGJmsT0\naEhmNIlj7Ok2msTudskixpkeTaIJia92iSZtMtHYS4zaaaN24nW6mwuyXBZZxW3YRIXIIrL/5o/z\nlBRFVd2nnqqntvt7v173daue5dR5KG796pzfc86RmeGcc84l0aXWFXDOOde4PIg455xLzIOIc865\nxDyIOOecS8yDiHPOucQ8iDjnnEss1SAiaaqk5ZJeknRDnv19JD0maaGk2ZJGZ+27SdISSYslPSzp\n0Gh7X0nPSlop6RlJR6V5Dc455wpLLYhI6grcDUwFTgGmSRqVc9jNwHwzGwtcDsyIzh0GXAWcbman\nAl2Bz0Xn3Ag8a2YjgT9Ez51zztVAmi2RicAqM3vNzHYDvwY+lXPMKOA5ADNbAQyTdDSwBdgN9JR0\nCNATWBudczHwUPT4IeDTKV6Dc865ItIMIoOB1VnP10Tbsi0ELgGQNBE4DhhiZpuAO4H/B6wDNpvZ\nv0XnDDCzDdHjDcCAdKrvnHOuI2kGkTjzqdwOHCWpHbgGaAf2SjoB+BowDDgG6CXpCwe9QJizxedt\ncc65GjkkxbLXAkOzng8ltEbeZ2ZbgS9nnkt6FXgF+CQw08w2Rtt/A0wBfgVskDTQzN6QNAh4M9+L\nS/Lg4pxzJTIzlXJ8mi2RucAIScMkdQcuA57IPkDSkdE+JF0FPG9m24AVwCRJPSQJ+CiwNDrtCeCK\n6PEVwG8LVcDMmvLnlltuqXkdmvn6fve70MC9+urmvL7vfjdc3/33V7bcF18M5ba01Pb60v6p9fuX\n5k8SqQURM9tD6KJ6mhAAHjWzZZKulnR1dNgpwGJJy4GPAddF5y4AfkEIRIuiY++Pft8OnC9pJfCR\n6LlzFTNzJlxwQfjdjNK6vpkzoVs32Ly5suW6+pZmdxZm9hTwVM62+7IetwEnFTj3e8D38mzfRGiZ\nOJeKtja49lr4whdgyxY44oha16hy9u2DWbPg0UfDNVZSWxt87GOwbFlly3X1zUesN6CWlpZaVyFV\ntby+PXvghRfgnHPg9NNh9uzKv0Ytr2/5cujbF847D9atg40bK1f2zJlw6aWwd29L5QqtQ83+91cq\nDyINqNn/E9fy+l58EYYMCR+0U6aEb9eVVsvra2sL19W1K0ycGFollbBxYwhKU6fCpk0tlSm0TjX7\n31+pPIg4l2XmTJg8OTyePLn58iLZ1zdlSuWub9asEJT69QutuS1bKlOuq38eRJzLMnNm+HCF8GE7\na1bIIzSL3OurVBDJlCvB0KGwenXH57jm4EHEuSyZ7h6A/v3h6KObJ1G8aROsXQtjxoTnkybB3Lmh\n5VCu7H83DyKdiwcR5yIbNoQP2pNP3r+tmbq0Zs2CM8+EQ6J7Mvv0CR/4ixYVP68je/bAnDlw1lnh\n+dChsGZN8XNc8/Ag4lykrS18O++S9VeRVnK9FrJbCxmVuL5Fi+DYY0NQgnBjgrdEOg8PIs5FspPO\nGc3UEsl3fZVIrre1HViud2d1Lh5EnItkJ50zxoyp/HiKWsh0OU2adOD2SgTJ3H83DyKdiwcR54Bd\nu6C9Pdymmq1r19DXX6nxFLWSPf4l20knhWlK1q9PXnZuN5kHkc7Fg4hzhABy4on5pzhphi6tfF1Z\nEPI/kycnz4usXw/vvAMjR+7flkmsJ5zPzzUYDyLOkT/pnNEMyfVi11dOEMnkQ7JvRjjiiPD8nXeS\nlekaiwcR5yj8TR1Cd9acOZUZT1Erxa6vnOR6blI9w7u0Og8PIs6RP6me0adPuIW13PEUtZJv/Eu2\niRNhwQLYubP0sgv9u3kQ6Tw8iLhOb/XqkFg/4YTCxzRyl1a+LqdsvXuHnEZ7e2nl7twJCxcefDMC\neBDpTDyIuE4v09WjIouCNnJyvVhXVkaSLq32dhgxIgShXD5qvfPwIOI6vWJJ54xGb4l0dH1JkuvF\nugB91Hrn4UHEdXpxvqmPHBnuNipnPEUtFBr/kivTEinlttxCSXXw7qzOxIOI69Teew+WLIEzzih+\nXLnjKWolM/7l8MOLHzd8OOzdG/+D36x4S8SDSOfhQcR1anPnwujR0LNnx8c2YpdWnK4sCPmgUvI+\nq1eHoDN8eP79PuCw8/Ag4jq1OF1ZGY2YXC/l+kpJrnd0M0KvXtCjR+PPOeY65kHEdWpxv6lDyCss\nXJhsPEWtlHJ9pXTXFevKyvDkeufgQcR1Wh316+dKOp6iVjLjX44/Pt7xEybA0qWwfXvHxxZLqmd4\nXqRz8CDiOq1XXoHu3cOHXVyN1KUVZ/xLth494NRTQ56omO3bQ7CZMKH4cR5EOgcPIq7TKqUVktFI\nyfVSurIy4gTJuXNDsOnRo/hxHkQ6Bw8irtMqJemckfmQbYS7jpJcX5wgGbdcH7XeOXgQcZ1Wkm/q\npY6nqJW4419yxQmScVtwnljvHDyIuE5p61ZYtQrGjy/tPKky65Knbe7csLRvR11OuYYMCeesWpV/\nv1m8pDp4d1ZnkWoQkTRV0nJJL0m6Ic/+PpIek7RQ0mxJo6PtJ0lqz/rZLOnaaN90SWuy9k1N8xpc\nc3rhBRg3LiTWS9UIyfUkXVkZxbq0Vq0KQWbIkI7LGTIE1q6FffuS1cM1htSCiKSuwN3AVOAUYJqk\nUTmH3QzMN7OxwOXADAAzW2Fm481sPDAB2A48Fp1jwF2Z/Wb2+7SuwTWvJEn1jEZIrifpqssoFiRL\n+Xfr0SOscvjmm8nq4RpDmi2RicAqM3vNzHYDvwY+lXPMKOA5CIEDGCbp6JxjPgq8bGbZDeOYNy06\nl18539RLGU9RC5nxL2m0ROJ2ZWV4cr35pRlEBgPZH/xrom3ZFgKXAEiaCBwH5DaUPwc8nLPtq1EX\n2IOSjqpclV1nsG8fzJqV/EP2sMPijaeolVdegUMPLW38S7Zx4+Dll2HLloP3ldqC8+R680sziMS5\nCfJ24ChJ7cA1QDuwN7NTUnfgIuAfs865BxgOjAPWA3dWqsKuc1i+HPr2hYEDk5dRz8n1crrqALp1\ng9NPh9mzD9y+ZUsIUGPHxi/Lk+vN75AUy14LZH8XGkpojbzPzLYCX848l/Qq8ErWIR8H5pnZW1nn\nvJl1/APAk4UqMH369Pcft7S00NLSUuIluGZUapdMPpMnwy9/WZn6VFo5XVkZmS6t88/fv2327BBc\nSrkZwYNIfWttbaW1tbWsMtIMInOBEZKGAeuAy4Bp2QdIOhJ4z8x2SboKeN7MtmUdMg14JOecQWaW\nWRroM8DiQhXIDiLOZZT7TR3Ch/Rf/EXIP8SdVqRa2trgS18qr4zJk+Geew7cluTfbehQWLCgvLq4\n9OR+ub711ltLLiO17iwz20PoonoaWAo8ambLJF0t6erosFOAxZKWAx8DrsucL6kXIan+m5yi75C0\nSNJC4MPA19O6BtecKtES6Wg8Ra1kxr+MG1deOZMnh5ZH9u25Sf7dPLHe/NJsiWBmTwFP5Wy7L+tx\nG3BSgXPfBfrl2X55havpOpFNm0L3yqmnll9WpstnxIjyy6qUF14IAyiTjH/J1r8/9OsHy5aFRbsy\nNyP84helleOJ9ebnI9ZdpzJrVlgX5JAKfH2qx+R6JbrqMrLXF1m2DI4+OgSXUgweHNal37u342Nd\nY/Ig4jqVSnRlZdTjyPVKJNUzsoNk0nIPPTTcCffGG5Wpk6s/HkRcp1LJb+rjxoVbXvONp6iFcse/\n5MoNIkn/3fwOrebmQcR1Gnv2wJw5MGlSZcorNJ6iVjLjXwYMqEx5o0eHrqiNG8trwXlyvbl5EHGd\nxosvhj76vn0rV2Y9zaNVznxZ+XTtGvJHv/sdrFsXZgVOwpPrzc2DiOs0KtmVlVFPyfU0rm/yZPjh\nD+Gss0JQScK7s5qbBxHXaVQyqZ4xaVLIQ9TDdOeVTKpnTJkSBguWU64HkeaW6jgR5wDefjt8yJZ6\ne2ilzZwJN91U2TL79w+3vj7wAAwaVNmyS7FzZ1i7I2mXUyGZ/FE5LZxqB5Hdu+H11+HEE6v3mp2Z\nBxGXuttuCx9y995buzps2BAGGp58cuXLvu46eLLgDG7Vc/31lRn/ku2oo+Cb34Szz05eRrUT648/\nDt//fv3c8NDsZMUWUwYk3QU8aGZLqlOlypBkHV2bq44zzwxBZNGi2tXht7+F++6Dp57q+FhXWbt3\nQ69eYf2VSge5fL7xDXjkkXBnmSuNJMyspNng4uRElgH3S3pB0p9HkyY6F8v27WEBp1degc2ba1eP\nNJLOLp5u3UKX37p11Xm9mTNDy3PXruq8XmfXYRAxs5+a2dmE5WuHESZMfFjSuWlXzjW+uXNDP/3p\np4d5nWoljaS6i69aeZEdO2Dx4jDv19q16b+ei3l3VrRe+smE5WzfIqxI+A1Jj6ZYN9cEMi2AWt4K\nu2sXtLeHMQ+uNqoVRObNg1GjwqSYPsCxOjrsoZT0Q8Lqgn8E/peZZb5P3iFpRZqVc42vrQ2+8IUw\nh1Lu+hTVsmABnHACHHFEbV7fVS+5nmlxvvWW31ZcLXFaIouAsWb2lawAknFWCnVyTcJsf0tk8uTa\njafwfEjtVWvUeua99rEp1RMniGwGumWeSDpK0qcBzOydtCrmGt+qVWHhpiFD9o+nWLas+vXwIFJ7\n1fhQN9vfEvEgUj1xgsgt2cEiejw9tRq5ppGbzK7V1OmeVK+9anyov/ZaWKr4uON8vq5qihNE8t0z\nnHAWHdeZ5LYAapFcX706jFE54YTqvq47UDWCSOb/m+QzB1dTnCAyT9Jdkk6QdGKUaJ+XdsVc48vX\nEqn2jLeZOqik4VOu0gYODDMGpDl2I/v/m3dnVU+cIPJVYDfwKPBrYAfwl2lWyjW+LVvg5ZfDwk0Z\nY8aEAWcbN1avHp4PqQ9du4ZAkubYjez3un//8H9wx470Xs8FcQYbbjOzG8zsjOjnJjN7txqVc41r\n9uwwwLB79/3bunYNU4rPmlW9engQqR9ptg62bYMVK8L/OYAuXeCYY7xLqxo6DCKS+kv6gaR/lfRc\n9PPHalTONa5CyexqJtffew+WLIEzzqjO67ni0gwic+bA2LFhPFKGJ9erI0531q+A5cDxhLuyXgPm\nplcl1wwKtQCqmVyfOzcs8dqjR3VezxWXZhDJ9//Nk+vVESeIfMDMHgB2mdnzZvYl4CMp18s1sH37\nQndWvpbIWWeFD/c9e9Kvh9/aW1/S/FDP9157cr064gSRzP0Ub0i6UNLpQJ8U6+Qa3LJl8IEP5F+E\nqk8fOPbY6kwL7/mQ+pJW91L2IMNsHkSqI04Q+Z+SjgKuB/4KeAD4eqq1cg2tow/vKVPSv9U3e8oV\nVx/S+lBfuRIOPzwk0qvxeu5ARYNINHvvSDN7x8wWm1mLmZ1uZk9UqX6uAXXUjVSN5Porr4Q7w4YO\nTfd1XHxpfagX+rLgifXqKBpEzGwvMK1KdXFNIk5LJO0g4q2Q+pMZu/Hee5Utt9B77Yn16ojTnfUf\nku6WdI6k0yVNiPIizh1k48YwoHDMmMLHjBwZVjlMc/nSmTM9qV5vunSBwYMrP+CwUMu3X7+wsub2\n7ZV9PXegOEFkPDAa+C5wJ/CD6HeHJE2VtFzSS5JuyLO/j6THJC2UNFvS6Gj7SZLas342S7o22tdX\n0rOSVkp6JsrXuDoxa1ZY/KlrkdnVunRJfwqUtjZvidSjSncxvfMOvP46nHbawfsk79Kqhjgj1lvM\n7Nzcn47Oi/IpdwNTgVOAaZJG5Rx2MzDfzMYSlt+dEb3mCjMbb2bjgQnAduCx6JwbgWfNbCTwh+i5\nqxNxu5HSTK5v3RqmoR8/Pp3yXXKVzovMng0TJoR13Kvxeu5gcVY2vAUwwmy+ltluZt/t4NSJwCoz\ney0q59fAp4DsFSVGAbdH5a2QNEzS0Wb2VtYxHwVeNrPMf4WLgQ9Hjx8CWvFAUjfa2uCv/7rj4yZP\nhm9/O506vPBCmLMre8oVVx8q/aHe0ZcWb4mkL0531rvRzzZgH/AJYFiM8wYD2W/fmmhbtoXAJQCS\nJgLHAUNyjvkc8HDW8wFmtiF6vAEYEKMurgr27AnTT0ya1PGxEyfCwoVhmvZK86R6/ap0EOmo29KT\n6+nrsCViZj/Ifi7p+8AzMcq2jg/hdmCGpHZgMdAO7M16re6E9d0PyqdEdTNJBV9n+vTp7z9uaWmh\npaUlRpVcUosWhT/aPjGGovbuDSNGQHt7vKBTipkz4StfqWyZrjKGDoWnnqpMWXv3hu6sYv9/hg6F\n+fMr83rNqLW1ldbW1rLK6DCI5NGLg1sU+awFsu/SH0pojbzPzLYCX848l/Qq8ErWIR8H5uV0b22Q\nNNDM3pA0CHizUAWyg4hLX6nJ7MytvpUMIvv2heT+z39euTJd5VSye2nJkjC9fL9+hY8ZOhQef7wy\nr9eMcr9c33rrrSWXEWcW38VZP0uAFUQJ8A7MBUZEeY7uwGXAAYMUJR0Z7UPSVcDzZrYt65BpwCM5\n5T4BXBE9vgL4bYy6uCootRspjeT68uWhJTRwYGXLdZVRye6sOHOjeWI9fXFaIhdlPd4DbDCz3R2d\nZGZ7JF0DPE1YTvdBM1sm6epo/32Eu7b+PuqSehG4MnO+pF6EpPpVOUXfDvyDpCsJMwpfGuMaXBW0\ntcG3vhX/+MmT4W/+JkxRUqmVB/3W3vrWr18YbPjuu9CrV3llzZwJZ59d/BhPrKdPZsVTF5ImAUvN\nbEv0/AhglJnNrkL9EpNkHV2bq5z168O062+/HcaBxGEGgwaFu6mOPbYy9bjyynDL51/8RWXKc5U3\nYgQ8+SScfHJ55YwcCb/5TfGBrWYh/7Z+PRxxRHmv1xlIwsxK+koX58/9XsKdWRnvRtuce19bW8ht\nxA0gEFoflZ5Hy0eq179K3DH11lvw5ptwyinFj5P8Dq20xfqTN7N9WY/3ErqnnHtf0m6kSs6jtWlT\n+LA49dTKlOfSUYkupra2sDZNnC8tnhdJV5wg8qqkayV1k9Rd0nUceAeVc4nHZlQyuT5rFpx5JhyS\n5J5DVzWV+FAvZcExDyLpihNE/hw4m3DL7hpgEuB34bv37dwJCxaEAYSlmjABli6tzCR5nlRvDJX4\nUC/lS4sn19MVZ+6sDWZ2mZn1j36mmVnBsRmu82lvD0nO3r1LP/eww0L309y55dfDR6o3hnKDyO7d\nYQDhWWfFfz3PiaQnzjiRX2TPlBvNvPuzdKvlGkm5yexKJNf37Al3eVV69LurvHI/1BcuhGHD4Mgj\n47+et0TSE6c76zQzeyfzxMz+BPh6Iu595XYjVSK5/uKLoduib9/yynHpK7d7qdQWpweRdMUJIpLU\nN+tJX/zuLBepxFrmmbVFyhnW47f2No4+fULLccuWZOeXklSH/UHEh42lI04QuRNok3SbpP8JtAHf\nT7darlGsXh0+EIYPT17GkCHQo0dYAyQpT6o3jszYjaStg1K/tBxxRLgV+J13Oj7WlS5OYv0XhOna\n3wTeAD4TbXPu/T/ocqctKfdWX0+qN5akQWTt2jBlyogRpb+eJ9fTEXew4RIz+wnwe+Cz0USMzlWs\nG6mc5PqGDWGgYbnTaLjqSfqhnunKKvVLi+dF0hPn7qzBkr4haQ5hksSuhIWinKtYN1I5LZEkU664\n2kqaXE/6/82DSHoK/tlJulpSK/AscBRh3Y/1ZjbdzBZXqX6ujr37bhgoOGFC+WWNGwcvv5ws2epJ\n9caT9EM96XvtQSQ9xb673Q1sBaaZ2Xc8cLhcc+eGGVR79Ci/rG7d4PTTw0p1pfJ8SONJ8qG+Y0dY\nPfPMM0t/PR+1np5iQWQQ8K/AjyUtk3Qb0K061XKNoNJ3RCXp0tq1K4yYTzLliqudJEFk/nwYNSrZ\nOiSeWE9PwSBiZm+b2T1m9mHgAmAzYWna5ZL+tmo1dHWr0t1ISZLr7e1w4om+VkSjyXyolzJ2o5z/\nb96dlZ64d2etNrMfmNkE4GJgR7rVcvXOrPItkcmTQ3fWvn0dH5vh40MaU5KxG+W810mCloun5PtZ\nzGylmX03jcq4xrFqVZg8cciQypXZvz984AOwbFn8czyp3rhKaR1kZkZI+l736hX+v27cmOx8V5jf\nFOkSSSuZXeo8Wp5Ub1ylBJHXXgtjQ447LvnreXI9Hb58T0paW+HOO2tdi/QsXw7XXFP5cqdMgdtv\nhyee6PjYvXtDYv2EEypfD5e+Y4+Fb34T7o2x2PZbb5U/M0KmS2v8+ORl3Hgj3HJLZe5IzPbww2EU\nfpI7z2qtYBCRNAEwQNHvA5jZ/BTr1fB+//vQNfPZz9a6Julpaal8mf/tv4VvjHH7rocMKX/KFVcb\n3/lOuDEirrFjy3u9cpPr69bBHXfAJz4BH/pQeXXJ9ZOfwEUXNVkQIUy8aEAPYAKwKNp+GjAX8J7o\nIlavhqlTw38MF1+vXnDhhbWuhauGIUMqm1PrSLlBJHP7+cyZlQ0iO3bAvHlhwG0jKnaLb4uZnQus\nA043swnR3Vnjo22uiDVrqvsH4pwrrhJBZNy48iYKzWf+/LBaY6OOY4mTWD85e7S6mb0IjEqvSs1h\n9erwn9Y5Vx8qsRjW9deH35W8VXjmTDj77MZN+scJIoskPSCpRdK5kn4KLEy7Yo1s374wZbW3RJyr\nH+WMWt+5MyzL++lPl7/2Ta6ZM+HSS5s7iHwJWApcB1wbPf5SmpVqdG++GdZ/PuywWtfEOZcxZEj4\nclfKYNaM+fPhpJOgd+/9K3FWQmbQ7kUXwfbt4afRxFmU6j3gXuAmM/uMmf3QzHzEehHeleVc/enR\nAw4/PNwuXKrs8UiljmUqJjP+ZdgwGDy4MVsjcdYTuRhoJyxIhaTxkmLcxd95eVLdufqUNLmePeVK\nuatw5is3s2RwIybX43RnTQfOAv4EYGbtwPFxCpc0NZqw8SVJN+TZ30fSY5IWSpotaXTWvqMk/VM0\ng/BSSWdF26dLWiOpPfqZGqcu1eQtEefqU5Lkuhn853/un3Jl7Njka9/kyp7KpVEniYwTRHabWe40\naR32KkrqSliTZCpwCjBNUu5dXTcD881sLHA5MCNr3wzgX81sFGFsyvJouwF3mdn46Of3Ma6hqjyI\nOFefknzbf/318HvYsPC7e/fka9/kyu4ma+YgskTSF4BDJI2Q9BMgTo/gRGCVmb1mZruBXwOfyjlm\nFPAcgJmtAIZJOlrSkcA5ZvazaN8eM9ucdV5dj1H2IOJcfUryQZ1vXfdKJNe3bYMVK0JASlq3ehAn\niHwVGA3sBB4BtgBfi3HeYCD7n2RNtC3bQuASAEkTgeOAIcBw4C1JP5c0X9JPJfXMrlPUBfagpKNi\n1KWqPIg4V5+SfFDnm+SzEsn1OXNC19ihh4bnQ4Y0Zk4kzgSMnzCzmwldTwBI+i/AP3ZwXpzhOLcD\nMyS1A4sJCfy9QHfgdOAaM5sj6UfAjcB3gHuAzFT0txGmZ7kyX+HTp09//3FLSwstaUz2lIcn1p2r\nT0mDyLRpB26bPBm++MVwu3CXhHOh566PUouWSGtrK62trWWVIetg6KWkdjMb39G2POdNAqab2dTo\n+U3APjO7o8g5rwKnAr2BNjMbHm3/IHCjmV2Yc/ww4EkzOzVPWdbRtaVh717o2RO2bg19p865+vHK\nK3DuufvzHB15992wzs3GjQeP+zrxRHj8cRg9Ov+5HbnwQvjSl/ZP0rppExx/fGkLdVWaJMyspHRB\nsVl8Pw58Ahgs6cfsz0McDuyOUfZcYET0Qb8OuAw4IJ5HuY/3zGyXpKuA581sG7BN0mpJI81sJfBR\nYEl0ziAzWx8V8RlCC6ZuvPEG9O3rAcS5ejR4MKxfH77sde3a8fFz5sBpp+UfOJzp0koSRDKDDO+/\nf/+2Pn3CHFpbt4bxLI2iWENsHTCPsBTuvKyfJ4CPdVSwme0BrgGeJoxyf9TMlkm6WtLV0WGnAIsl\nLY/KvC6riK8Cv5K0kHB3VmZd9zskLYq2fxj4eqwrrRLPhzhXvw49NHzJ27Ah3vGZpHo+5STXV64M\ngeKYY/Zvy4wVabTkesGWiJktBBZKetjMdiUp3MyeAp7K2XZf1uM24KQir3/Q7PpmdnmSulSLBxHn\n6lvmgzr7A7yQmTPhiivy75syBWbMyL8vTrn5VuTMJNdPOSVZubUQJyU0LBr0t1TSq9HPK6nXrEF5\nUt25+hb3236my6nQ8stjxoSusSTrthcqtxFbInGCyM8Jc2ftAVqAh4BfpVinhuYtEefqW9wP6pde\nCoukFWqxdO0aViKcNav0OmSPVE9St3oSJ4j0MLN/I9zJ9bqZTQc+mW61GpcHEefqW9zxGIW6nLIl\nGS/yzjvh7rDTTjt4X7MGkR3RFCarJF0j6RKgV8r1algeRJyrb3E/qIsl1TOSJNdnz4YJE6Bbt+R1\nqydxgsjXgJ6EtUTOAP4rUCDV5DyIOFff4n5Qx2mJTJoUbgPesyf+6xcrtxFHrcdZT+QFM9tqZqvN\n7ItmdomZJegFbH67d4e1CgYNqnVNnHOFxAkimzfDq6+GaUmK6dMHjj0WFpcwWq1Ysj5TtxqMk06s\n2GDDJ7OeGvsHGxqAmV2cYr0a0vr1YXTrIXEmk3HO1cSgQWH10T17Cv+tFutyyjV5cmhdjC86h0ew\nd28oe9Kk/PuPPDKMF9m8GY6qu1kB8yvWErkz+nkFeA+4H/gp8G60zeXwrizn6l+3bnD00eFLXyFx\nurIySkmuL1kCAwdCv36Fj2m0vEjBIGJmrWbWCnzQzC4zsyfN7AkzmwacU7UaNhAPIs41ho4+qOMk\n1TNKSa7HKbdpgkiWnpJOyDyRdDwh0e5yeBBxrjEU+6DeuzeM/YgbRE46Kdy2W6xlkxGnhdNoyfU4\nQeTrwHOSnpf0PGERqTjriXQ6PlrducZQLIgsXQoDBoQurzi6dInfGokTRJquJRItPzuSMDnitcBI\nM3s67Yo1Im+JONcYin1Ql9KVlREniLz1VvjpaF6spgkiks6Lfn+WMCX8CcCJwCejAYcuhwcR5xpD\nsS6jUpLUZhRZAAAWUUlEQVTqGXGS621tcNZZHS9i1WhBpNjNqB8C/gBcRP5VCn+TSo0amAcR5xpD\nRy2Rb3yjtPImToQFC2Dnzv3L3eYrN04Lp2mCiJndEv3+YtVq08B27gwrkw0YUOuaOOc6UuiD+u23\nw8JypS401bs3jBwJ7e2Fx4DMnAnf/GbHZWVaSWZhzEi9KzbY8Po8mzODDs3M7kqtVg1o3bowiCnO\namnOudoaODB86du168BVSDNdTkn+jjNdWvmCyO7dMG9eKLsjvXuH1symTfCBD5Rej2or1jt3OGGt\n8+yfw7N+XBbvynKucXTtGgLJ2rUHbk+SVM8ollxfuBCGDw8j0uNopC6tYt1Z06tYj4bnQcS5xpLp\nNho+fP+2mTPhppuSlTdlCtxwQ/5uqFKT9ZkgMm5csrpUU4ezPEnqAVxJWA+9B/vnzvpyulVrLB5E\nnGssud/2S+lyymf48DAf1+rVYVLGbG1t8LGPJa9bPYsz2PCXwABgKtAKDAW2pVinhuRBxLnGkvtB\nvWgRHHdc8okPpcK3+pbaEmmkUetxgsiJZvZtYJuZPUQYM5IwVjcvH63uXGPJDSJJxofkyhdE1q6F\nd9+FESOS162exQkiu6LfmyWdChwFxJwQoPPwlohzjSX3g7qcpHpGvuR6ptxSbtdttiDyU0l9gW8B\nTwBLge+lWqsG5EHEucaS22VUiZbIhAlh7q3t28srtymCiKSlkr4F/NHMNpnZ82Y23MyONrN7q1jH\nuvfee7B1a/wJ25xztZf9Qb1uXfgbHjmyvDJ79IAxY2Du3P3bkrRwhgwJ3WD79pVXn2oo1hL5PGFs\nyDOS5kj6uqRjqlSvhrJmDQwe3PGcOM65+tG/P2zZAjt2JOtyKiQ7L7JjR0jYn3lmaWX06BEGHb79\ndvn1SVuxRakWmNmNZnYC8FXgOGCWpOckfaVqNWwAnlR3rvF06QLHHBP+fivRlZWRHUTmzYNRo6BX\nr9LLaZQurVjfnc1sFmFdkSuAPsDdaVaq0Xg+xLnGlPmgrkRSPSOTXDcrr9ymCSKSJkq6C3gdmA7c\nC3i3VhYPIs41piFD4OWXw7QkpXY5FSuzRw9Ytaq8Fk7DBxFJfyvpZeDvgLXAFDP7sJnda2axeuok\nTZW0XNJLkm7Is7+PpMckLZQ0W9LorH1HSfonScuiJP+kaHtfSc9KWinpGUkJhwZVjgcR5xrT0KHw\n+ONhidvevStX7uTJIYB09pbITmCqmZ1hZnea2RpJF8YtWFJXQrfXVMKUKdMkjco57GZgvpmNBS4H\nZmTtmwH8q5mNAk4DlkXbbwSeNbORhPVOboxbp7R4EHGuMQ0dCk8/Xbl8SMaUKfDwwyFRf9xxycpo\nlFHrxRLrt5rZSzmbbyuh7InAKjN7zcx2A78GPpVzzCjCmu2Y2QpgmKSjJR0JnGNmP4v27TGzzdE5\nFwMPRY8fAj5dQp1S4Yl15xrT0KFhzqw0gsgzz4TfSe/4aoaWSLkGA9n/BGuibdkWApdAyL0Q7gAb\nAgwH3pL0c0nzJf1UUs/onAFmtiF6vIEwr1dimzaFBaXK4S0R5xpT5u+2Ukn1jLFj4bDDyiu3UYJI\nh7P45ri6hGPzLamb63ZghqR2YDHQDuwFugOnA9eY2RxJPyJ0W33ngBcwM0kFX2f69OnvP25paaGl\npeWgYz79afj2t+H882PUNo933w33gjfC4jHOuQMdfzxccAEMG1bZcrt3h8suC2UnNXgwrF8Pe/em\nt9hda2srra2tZZUhs+Kf9ZIuBX5vZlskfZvw4X6bmc3v4LxJwHQzmxo9vwnYZ2Z3FDnnVeBUwiDH\nNjMbHm0/B7jBzC6UtBxoMbM3JA0CnjOzk/OUZR1dG4T5/3v2hFtu6fDQvJYvh4svhpUrk53vnHOF\nDBwI8+eH8SzVIAkzK6kDLk531rejAPJB4DzgQeCeGOfNBUZIGiapO3AZYe6t7AofGe1D0lXA82a2\nzczeAFZLykxCcB6wJHr8BGG8CtHv38aoS0FTphRejSwO78pyzqWlEZLrcYLI3uj3hcBPzexfCN1N\nRZnZHuAa4GnCpI2PmtkySVdLynSLnQIsjloXHwOuyyriq8CvJC0k3J31t9H224HzJa0EPhI9T2zy\nZJg1K/kcNZ5Ud86lpRHyInFyImsl3Q+cD9wu6TDij3R/CngqZ9t9WY/bgJMKnLsQOGj4j5ltAj4a\n5/Xj6N8f+vWDZctg9OiOj8/lLRHnXFoaIYjECQaXEloTF5jZO4RpT/461VpVWaHVyOLwIOKcS0uz\nBJGBwO/M7CVJ5xKCygvpVqu6MqNLk/Ag4pxLS7MEkd8AeySdCNxHGMfxcKq1qrJykuseRJxzaWmW\nxPq+KEl+CfATM/trYFC61aquMWPCojQbN5Z+rifWnXNpaZaWyC5JnyfMbfUv0bZu6VWp+rp2hYkT\nw11apdiyJQwEOqrmU0A655rRMcfAhg2wZ0+ta1JYnCDyZWAy8L/M7FVJxwO/TLda1ZckuZ7pyqrE\namjOOZerW7ew7Pb69bWuSWEdBhEzWwL8FfCipDHA6mKjzhtVkuS650Occ2mr9y6tDseJSGohzJb7\nerTpWElXmNnzaVas2iZNgrlzQ7PxkJgzinkQcc6lrd6T63G6s+4ijBH5kJl9CLgA+GG61aq+Pn3g\n2GNh0aL453hS3TmXtnpvicQJIodEa30AYGYrKX3234aQWRs5Lm+JOOfS1gxBZJ6kByS1SDpX0gOE\nyRWbTqnJdQ8izrm0NUMQ+XPC0rTXEiZFXAL8jzQrVSveEnHO1Zt6DyJF1xORdAjwYr71Oupd3PVE\nsu3bFyZjXLIEBnUwnNIMeveGN96Aww8vo6LOOVfEmjVhHNu6dem/VsXXE4lGqq+QlHCp+cbSpUv8\n1sg774S7uDyAOOfSNGgQvP027NpV65rkF6c7qy+wRNIfJT0Z/TzR4VkNKm4Q8a4s51w1dO0aVjis\nRkskiTh3WX0753lpfUQNZsqUsOZ6RzyIOOeqJZMXqfRa8JVQMIhIGgEMMLPWnO0fBOp4EH55Jk6E\nBQtg50449NDCx3kQcc5VSz0n14t1Z/0I2JJn+5ZoX1Pq3RtGjoT29uLHeRBxzlVLPY9aLxZEBpjZ\nQeO3o23D06tS7cUZL+Kj1Z1z1dKoLZFiE5wfVumK1JM4yXVviTjnqqVRg8hcSV/J3SjpKmBeelWq\nvUxLpNgwEw8izrlqqecgUnCwoaSBwGPALvYHjQnAocBnzKyuk+tJBhtmmIVb6ubMCZMy5tvfs2dY\nCbFnzzIr6pxzHXjjDTjtNHjzzXRfJ8lgw4J3Z5nZG5KmAOcCYwi39v6Lmf2xvGrWP2l/ayRfEHn7\n7RA8PIA456qhf3/YvBl27IDD6iyZ0NGIdTOzP5rZj83sJ50hgGQUS657Ut05V01duoSlcteurXVN\nDhZnxHqnVCy57vkQ51y11WtexINIARMmwNKlsH37wfs8iDjnqs2DSIPp0QPGjAlL5ubyIOKcqzYP\nIg2oUF7Eg4hzrtqGDq3PUeupBhFJUyUtl/SSpBvy7O8j6TFJCyXNljQ6a99rkhZJapf0Qtb26ZLW\nRNvbJU1Nq/6Fgogn1p1z1TZkSCdriUjqCtwNTAVOAaZJGpVz2M3AfDMbC1wOzMjaZ0CLmY03s4k5\n2++Kto83s9+ndQ2Z5HrucBNviTjnqq0zdmdNBFaZ2Wtmthv4NfCpnGNGAc8BmNkKYJiko7P2Fxr0\nUtJgmKSGDAm5kVWr9m/bty/cZuctEedcNXXGIDIYyL7kNdG2bAuBSwAkTQSOAzIfzwb8m6S50VQr\n2b4adYE9KKnYHF9ly73V98034cgj62/Aj3OuufXrF+4WzXfHaC2lGUTizDlyO3CUpHbgGqAd2Bvt\n+6CZjQc+DvylpHOi7fcQZhEeR1jX5M6K1jpHbl7Eu7Kcc7Ug1eeU8HFWNkxqLZD9cTuU0Bp5n5lt\nBb6ceS7pVeCVaN+66Pdbkh4jdI/9u5m9mXX8A8CThSowffr09x+3tLTQ0tJS8kVMmQIPPrj/uSfV\nnXO1kkmujxxZmfJaW1tpbW0tq4yCEzCWS9IhwArgPGAd8AIwzcyWZR1zJPCeme2KuqzONrMvSuoJ\ndDWzrZJ6Ac8At5rZM5IGZSZ/lPR14Ewz+3ye1088AWO2Xbugb9+wvvERR8CPfwwrV8Ldd5ddtHPO\nleTyy+EjH4EvfjGd8is6AWO5zGyPpGuAp4GuwINmtkzS1dH++wh3bf29JANeBK6MTh8APCYpU8df\nmdkz0b47JI0jdJe9Clyd1jUAdO8Op58Os2fD+ed7d5ZzrnbqMbmeZncWZvYU8FTOtvuyHrcBJ+U5\n71VCziNfmZdXuJodyiTXM0Fk/Phq18A550IQmT+/1rU4kI9YjyE7ue4tEedcrdTjqHUPIjFMngyz\nZoUxIp5Yd87VSj2OWvcgEkP//uEe7RdfDCuMDc4d7eKcc1VQjzkRDyIxTZkCjz0W7tTq3r3WtXHO\ndUZ9+sDu3bB1a61rsp8HkZgmT4ZHH/V8iHOudqT6a414EIlpyhRYtsyDiHOutuotue5BJKYxY6B3\nb0+qO+dqq96S66mOE2kmXbvCWWd5S8Q5V1tDh8LDD9dPIPEgUoLbboNBg2pdC+dcZ/b5z0OXOupD\nSm3urFqr1NxZzjnXWSSZO6uO4plzzrlG40HEOedcYh5EnHPOJeZBxDnnXGIeRJxzziXmQcQ551xi\nHkScc84l5kHEOedcYh5EnHPOJeZBxDnnXGIeRJxzziXmQcQ551xiHkScc84l5kHEOedcYh5EnHPO\nJeZBxDnnXGIeRJxzziXmQcQ551xiqQYRSVMlLZf0kqQb8uzvI+kxSQslzZY0Omvfa5IWSWqX9ELW\n9r6SnpW0UtIzko5K8xqcc84VlloQkdQVuBuYCpwCTJM0Kuewm4H5ZjYWuByYkbXPgBYzG29mE7O2\n3wg8a2YjgT9EzzuV1tbWWlchVX59jc2vr3NJsyUyEVhlZq+Z2W7g18Cnco4ZBTwHYGYrgGGSjs7a\nn2/B+IuBh6LHDwGfrmitG0Cz/yf262tsfn2dS5pBZDCwOuv5mmhbtoXAJQCSJgLHAUOifQb8m6S5\nkq7KOmeAmW2IHm8ABlS64s455+I5JMWyLcYxtwMzJLUDi4F2YG+074Nmti5qmTwrabmZ/fsBL2Bm\nkuK8jnPOuRTILJ3PYEmTgOlmNjV6fhOwz8zuKHLOq8CpZrYtZ/stwFYzu0vSckKu5A1Jg4DnzOzk\nPGV5cHHOuRKZWb40QkFptkTmAiMkDQPWAZcB07IPkHQk8J6Z7Yq6rJ43s22SegJdzWyrpF7ABcCt\n0WlPAFcAd0S/f5vvxUv9h3DOOVe61IKIme2RdA3wNNAVeNDMlkm6Otp/H+Gurb+PWg0vAldGpw8A\nHpOUqeOvzOyZaN/twD9IuhJ4Dbg0rWtwzjlXXGrdWc4555pf041Y72iAY6MrNAizUUn6maQNkhZn\nbWuaAaUFrm+6pDXRe9guaWot65iUpKGSnpO0RNKLkq6NtjfF+1fk+prl/TssGuS9QNJSSf872l7S\n+9dULZFogOMK4KPAWmAOMM3MltW0YhUU3Xwwwcw21boulSDpHGAb8AszOzXa9j3gbTP7XvRFoI+Z\nNeSg0gLX9/6NIjWtXJkkDQQGmtkCSb2BeYRxW1+iCd6/Itd3KU3w/gFI6mlm2yUdAvwH8FeEsXix\n379ma4nEGeDYDJrmpoHotu0/5WxumgGlBa4PmuA9NLM3zGxB9HgbsIwwFqwp3r8i1wdN8P4BmNn2\n6GF3Qu76T5T4/jVbEIkzwLHRFRqE2Uw6w4DSr0Zzxj3YqN092aK7MMcDs2nC9y/r+mZFm5ri/ZPU\nRdICwvv0nJktocT3r9mCSPP0zRV2tpmNBz4O/GXUXdK0LPS3Ntv7eg8wHBgHrAfurG11yhN19fwz\ncJ2Zbc3e1wzvX3R9/0S4vm000ftnZvvMbBxhppAPSTo3Z3+H71+zBZG1wNCs50MJrZGmYWbro99v\nAY8RuvCazYaoP5poQOmbNa5PRZnZmxYBHqCB30NJ3QgB5Jdmlhmz1TTvX9b1/Z/M9TXT+5dhZpuB\n3wETKPH9a7Yg8v4AR0ndCQMcn6hxnSpGUk9Jh0ePM4MwFxc/qyFlBpRCkQGljSr6w8z4DA36HioM\n5HoQWGpmP8ra1RTvX6Hra6L3r1+mK05SD+B8wtRTJb1/TXV3FoCkjwM/Yv8Ax/9d4ypVjKThhNYH\n7B+E2dDXJ+kR4MNAP0L/63eAx4F/AI4lGlBqZu/Uqo7lyHN9twAthK4QA14Frs7qg24Ykj4I/F9g\nEfu7PG4CXqAJ3r8C13czYeaNZnj/TiUkzrtEP780s+9L6ksJ71/TBRHnnHPV02zdWc4556rIg4hz\nzrnEPIg455xLzIOIc865xDyIOOecS8yDiHPOucQ8iLiGJumPki7I2fY1SX9X5JxWSRNSrtcj0dxK\n1+Vsny7p+ujxYdGU29/Jc/5/iabn/kMZddiW9fgTklZIOjaqw7uSji5w7D5JP8h6/lfRzMPOHcSD\niGt0jwCfy9l2GfBwkXNSnc8pmjLiDDMba2Yz8r12NKPCPwNzzOy7eYq5EvjvZnZezNfMt0qpRfvO\nA2YAU83s/0X73gauzz02sgv4jKQP5Nnn3AE8iLhG98/AJzMfotFsq8eY2X9IukfSnGhBoen5Ts75\nBv5nkn4ePT5a0j9JeiH6mZLn3MMk/VxhkbD5klqiXc8Ag6MFiz6Y52W7EZYpWGFmN+cp9zvA2cDP\nJN0h6dB8ryPpi5KeiForzxa4vg8B9wOfNLNXo80G/Ay4rMAMtLujc76er0znsnkQcQ0tWpzrBeAT\n0abPAY9Gj282szOBscCHo2keDiqiwOMZwA/NbCLwZ4SJ9nL9JbDXzE4jTIXxUNTCuAh42czGm9l/\n5Jwj4G+AnWb2jQLX9F3CPHCfN7MbgGvyvM6h0eHjgc+a2bl5ijqMME3Op8xsZc6+bYRA8rV8dQD+\nDviCpCMK7HcO8CDimkN2l9Zl0XMI37TnAfOB0cCoEsr8KHC3pHbCXF6HS+qZc8zZwP8BMLMVwOvA\nSIovWGSEFeSmSBoRsy6FXseAZ4vMa7QL+E/gvxeox4+BK6Kpzg/cGaZ0/wVwbcw6uk7Kg4hrBk8A\n50kaD/Q0s/ZossrrgY+Y2VjCNNeH5Tk3u/XRI+uxgLOi1sR4MxuatQocOceV6v8Suoqeyky5HUOh\n13m3yDn7CEu5TpR0U2550fTfDxNaOvn8iJCb6RWzjq4T8iDiGl60UNBzwM/Zn1A/gvABu0XSAMIi\nXvlskHSypC6Eab0zQeUZsr6FSxqX59x/B74Q7R9JmPV0Rcw6/wb4AfB7SUd2cHi+11lOjABmZjuA\nTxK6pr6c55C7gKsJs0LnnvsnwmyuV+LJdVeABxHXLB4BTo1+Y2YLCWsjLAd+RehCyudG4F8I3T7r\nsrZfC5wR3aa7BPhKnnP/DugiaREhUX6Fme2O9hX70LWojvcSchZPZOU48in0Oh3dZZZ5nT8BU4Fv\nSbooZ99G4DeENbYPOC9yJ2Eae+fy8qngnXPOJeYtEeecc4l5EHHOOZeYBxHnnHOJeRBxzjmXmAcR\n55xziXkQcc45l5gHEeecc4l5EHHOOZfY/weIvyYcr3CuNwAAAABJRU5ErkJggg==\n", "text": [ "" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation example: model selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Compare the best KNN model with logistic regression on the iris dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# 10-fold cross-validation with the best KNN model\n", "knn = KNeighborsClassifier(n_neighbors=20)\n", "print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.98\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "# 10-fold cross-validation with logistic regression\n", "from sklearn.linear_model import LogisticRegression\n", "logreg = LogisticRegression()\n", "print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.953333333333\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation example: feature selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal**: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.linear_model import LinearRegression" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "# read in the advertising dataset\n", "data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "# create a Python list of three feature names\n", "feature_cols = ['TV', 'Radio', 'Newspaper']\n", "\n", "# use the list to select a subset of the DataFrame (X)\n", "X = data[feature_cols]\n", "\n", "# select the Sales column as the response (y)\n", "y = data.Sales" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "# 10-fold cross-validation with all three features\n", "lm = LinearRegression()\n", "scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')\n", "print scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754 -1.74163618\n", " -8.17338214 -2.11409746 -3.04273109 -2.45281793]\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "# fix the sign of MSE scores\n", "mse_scores = -scores\n", "print mse_scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 3.56038438 3.29767522 2.08943356 2.82474283 1.3027754 1.74163618\n", " 8.17338214 2.11409746 3.04273109 2.45281793]\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "# convert from MSE to RMSE\n", "rmse_scores = np.sqrt(mse_scores)\n", "print rmse_scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 1.88689808 1.81595022 1.44548731 1.68069713 1.14139187 1.31971064\n", " 2.85891276 1.45399362 1.7443426 1.56614748]\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate the average RMSE\n", "print rmse_scores.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1.69135317081\n" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "# 10-fold cross-validation with two features (excluding Newspaper)\n", "feature_cols = ['TV', 'Radio']\n", "X = data[feature_cols]\n", "print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1.67967484191\n" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improvements to cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Repeated cross-validation**\n", "\n", "- Repeat cross-validation multiple times (with **different random splits** of the data) and average the results\n", "- More reliable estimate of out-of-sample performance by **reducing the variance** associated with a single trial of cross-validation\n", "\n", "**Creating a hold-out set**\n", "\n", "- \"Hold out\" a portion of the data **before** beginning the model building process\n", "- Locate the best model using cross-validation on the remaining data, and test it **using the hold-out set**\n", "- More reliable estimate of out-of-sample performance since hold-out set is **truly out-of-sample**\n", "\n", "**Feature engineering and selection within cross-validation iterations**\n", "\n", "- Normally, feature engineering and selection occurs **before** cross-validation\n", "- Instead, perform all feature engineering and selection **within each cross-validation iteration**\n", "- More reliable estimate of out-of-sample performance since it **better mimics** the application of the model to out-of-sample data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources\n", "\n", "- scikit-learn documentation: [Cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html), [Model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html)\n", "- scikit-learn issue on GitHub: [MSE is negative when returned by cross_val_score](https://github.com/scikit-learn/scikit-learn/issues/2439)\n", "- Section 5.1 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (11 pages) and related videos: [K-fold and leave-one-out cross-validation](https://www.youtube.com/watch?v=nZAM5OXrktY) (14 minutes), [Cross-validation the right and wrong ways](https://www.youtube.com/watch?v=S06JpVoNaA0) (10 minutes)\n", "- Scott Fortmann-Roe: [Accurately Measuring Model Prediction Error](http://scott.fortmann-roe.com/docs/MeasuringError.html)\n", "- Machine Learning Mastery: [An Introduction to Feature Selection](http://machinelearningmastery.com/an-introduction-to-feature-selection/)\n", "- Harvard CS109: [Cross-Validation: The Right and Wrong Way](http://nbviewer.ipython.org/github/cs109/content/blob/master/lec_10_cross_val.ipynb)\n", "- Journal of Cheminformatics: [Cross-validation pitfalls when selecting and assessing regression and classification models](http://www.jcheminf.com/content/pdf/1758-2946-6-10.pdf)" ] } ], "metadata": {} } ] }