{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross-validation for parameter tuning, model selection, and feature selection ([video #7](https://www.youtube.com/watch?v=6dbrR-WymjI&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=7))\n", "\n", "Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n", "\n", "**Note:** This notebook uses Python 3.9.1 and scikit-learn 0.23.2. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Agenda\n", "\n", "- What is the drawback of using the **train/test split** procedure for model evaluation?\n", "- How does **K-fold cross-validation** overcome this limitation?\n", "- How can cross-validation be used for selecting **tuning parameters**, choosing between **models**, and selecting **features**?\n", "- What are some possible **improvements** to cross-validation?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review of model evaluation procedures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Motivation:** Need a way to choose between Machine Learning models\n", "\n", "- Goal is to estimate likely performance of a model on **out-of-sample data**\n", "\n", "**Initial idea:** Train and test on the same data\n", "\n", "- But, maximizing **training accuracy** rewards overly complex models which **overfit** the training data\n", "\n", "**Alternative idea:** Train/test split\n", "\n", "- Split the dataset into two pieces, so that the model can be trained and tested on **different data**\n", "- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance\n", "- But, it provides a **high variance** estimate since changing which observations happen to be in the testing set can significantly change testing accuracy" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# added empty cell so that the cell numbering matches the video" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn import metrics" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# read in the iris data\n", "iris = load_iris()\n", "\n", "# create X (features) and y (response)\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9736842105263158\n" ] } ], "source": [ "# use train/test split with different random_state values\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)\n", "\n", "# check classification accuracy of KNN with K=5\n", "knn = KNeighborsClassifier(n_neighbors=5)\n", "knn.fit(X_train, y_train)\n", "y_pred = knn.predict(X_test)\n", "print(metrics.accuracy_score(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?\n", "\n", "**Answer:** That's the essense of cross-validation!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Steps for K-fold cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Split the dataset into K **equal** partitions (or \"folds\").\n", "2. Use fold 1 as the **testing set** and the union of the other folds as the **training set**.\n", "3. Calculate **testing accuracy**.\n", "4. Repeat steps 2 and 3 K times, using a **different fold** as the testing set each time.\n", "5. Use the **average testing accuracy** as the estimate of out-of-sample accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Diagram of **5-fold cross-validation:**\n", "\n", "![5-fold cross-validation](images/07_cross_validation_diagram.png)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# added empty cell so that the cell numbering matches the video" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# added empty cell so that the cell numbering matches the video" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration Training set observations Testing set observations\n", " 1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4] \n", " 2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9] \n", " 3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14] \n", " 4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19] \n", " 5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24] \n" ] } ], "source": [ "# simulate splitting a dataset of 25 observations into 5 folds\n", "from sklearn.model_selection import KFold\n", "kf = KFold(n_splits=5, shuffle=False).split(range(25))\n", "\n", "# print the contents of each training and testing set\n", "print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))\n", "for iteration, data in enumerate(kf, start=1):\n", " print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Dataset contains **25 observations** (numbered 0 through 24)\n", "- 5-fold cross-validation, thus it runs for **5 iterations**\n", "- For each iteration, every observation is either in the training set or the testing set, **but not both**\n", "- Every observation is in the testing set **exactly once**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing cross-validation to train/test split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Advantages of **cross-validation:**\n", "\n", "- More accurate estimate of out-of-sample accuracy\n", "- More \"efficient\" use of data (every observation is used for both training and testing)\n", "\n", "Advantages of **train/test split:**\n", "\n", "- Runs K times faster than K-fold cross-validation\n", "- Simpler to examine the detailed results of the testing process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation recommendations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. K can be any number, but **K=10** is generally recommended\n", "2. For classification problems, **stratified sampling** is recommended for creating the folds\n", " - Each response class should be represented with equal proportions in each of the K folds\n", " - scikit-learn's `cross_val_score` function does this by default" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation example: parameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Select the best tuning parameters (aka \"hyperparameters\") for KNN on the iris dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1. 0.93333333 1. 1. 0.86666667 0.93333333\n", " 0.93333333 1. 1. 1. ]\n" ] } ], "source": [ "# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)\n", "knn = KNeighborsClassifier(n_neighbors=5)\n", "scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')\n", "print(scores)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9666666666666668\n" ] } ], "source": [ "# use average accuracy as an estimate of out-of-sample accuracy\n", "print(scores.mean())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.96, 0.9533333333333334, 0.9666666666666666, 0.9666666666666666, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9800000000000001, 0.9666666666666666, 0.9666666666666666, 0.9733333333333334, 0.96, 0.9666666666666666, 0.96, 0.9666666666666666, 0.9533333333333334, 0.9533333333333334, 0.9533333333333334]\n" ] } ], "source": [ "# search for an optimal value of K for KNN\n", "k_range = list(range(1, 31))\n", "k_scores = []\n", "for k in k_range:\n", " knn = KNeighborsClassifier(n_neighbors=k)\n", " scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')\n", " k_scores.append(scores.mean())\n", "print(k_scores)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Cross-Validated Accuracy')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)\n", "plt.plot(k_range, k_scores)\n", "plt.xlabel('Value of K for KNN')\n", "plt.ylabel('Cross-Validated Accuracy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation example: model selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal:** Compare the best KNN model with logistic regression on the iris dataset" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9800000000000001\n" ] } ], "source": [ "# 10-fold cross-validation with the best KNN model\n", "knn = KNeighborsClassifier(n_neighbors=20)\n", "print(cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9533333333333334\n" ] } ], "source": [ "# 10-fold cross-validation with logistic regression\n", "from sklearn.linear_model import LogisticRegression\n", "logreg = LogisticRegression(solver='liblinear')\n", "print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation example: feature selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal**: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# read in the advertising dataset\n", "data = pd.read_csv('data/Advertising.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# create a Python list of three feature names\n", "feature_cols = ['TV', 'Radio', 'Newspaper']\n", "\n", "# use the list to select a subset of the DataFrame (X)\n", "X = data[feature_cols]\n", "\n", "# select the Sales column as the response (y)\n", "y = data.Sales" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754 -1.74163618\n", " -8.17338214 -2.11409746 -3.04273109 -2.45281793]\n" ] } ], "source": [ "# 10-fold cross-validation with all three features\n", "lm = LinearRegression()\n", "scores = cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')\n", "print(scores)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3.56038438 3.29767522 2.08943356 2.82474283 1.3027754 1.74163618\n", " 8.17338214 2.11409746 3.04273109 2.45281793]\n" ] } ], "source": [ "# fix the sign of MSE scores\n", "mse_scores = -scores\n", "print(mse_scores)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.88689808 1.81595022 1.44548731 1.68069713 1.14139187 1.31971064\n", " 2.85891276 1.45399362 1.7443426 1.56614748]\n" ] } ], "source": [ "# convert from MSE to RMSE\n", "rmse_scores = np.sqrt(mse_scores)\n", "print(rmse_scores)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.6913531708051797\n" ] } ], "source": [ "# calculate the average RMSE\n", "print(rmse_scores.mean())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.6796748419090768\n" ] } ], "source": [ "# 10-fold cross-validation with two features (excluding Newspaper)\n", "feature_cols = ['TV', 'Radio']\n", "X = data[feature_cols]\n", "print(np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')).mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improvements to cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Repeated cross-validation**\n", "\n", "- Repeat cross-validation multiple times (with **different random splits** of the data) and average the results\n", "- More reliable estimate of out-of-sample performance by **reducing the variance** associated with a single trial of cross-validation\n", "\n", "**Creating a hold-out set**\n", "\n", "- \"Hold out\" a portion of the data **before** beginning the model building process\n", "- Locate the best model using cross-validation on the remaining data, and test it **using the hold-out set**\n", "- More reliable estimate of out-of-sample performance since hold-out set is **truly out-of-sample**\n", "\n", "**Feature engineering and selection within cross-validation iterations**\n", "\n", "- Normally, feature engineering and selection occurs **before** cross-validation\n", "- Instead, perform all feature engineering and selection **within each cross-validation iteration**\n", "- More reliable estimate of out-of-sample performance since it **better mimics** the application of the model to out-of-sample data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources\n", "\n", "- scikit-learn documentation: [Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html), [Model evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)\n", "- scikit-learn issue on GitHub: [MSE is negative when returned by cross_val_score](https://github.com/scikit-learn/scikit-learn/issues/2439)\n", "- Section 5.1 of [An Introduction to Statistical Learning](https://www.statlearning.com/) (11 pages) and related videos: [K-fold and leave-one-out cross-validation](https://www.youtube.com/watch?v=rSGzUy13F_0&list=PL5-da3qGB5IA6E6ZNXu7dp89_uv8yocmf&index=2) (14 minutes), [Cross-validation the right and wrong ways](https://www.youtube.com/watch?v=r64tRyHFAJ8&list=PL5-da3qGB5IA6E6ZNXu7dp89_uv8yocmf&index=3) (10 minutes)\n", "- Scott Fortmann-Roe: [Accurately Measuring Model Prediction Error](http://scott.fortmann-roe.com/docs/MeasuringError.html)\n", "- Machine Learning Mastery: [An Introduction to Feature Selection](https://machinelearningmastery.com/an-introduction-to-feature-selection/)\n", "- Harvard CS109: [Cross-Validation: The Right and Wrong Way](https://github.com/cs109/content/blob/master/lec_10_cross_val.ipynb)\n", "- Journal of Cheminformatics: [Cross-validation pitfalls when selecting and assessing regression and classification models](https://jcheminf.biomedcentral.com/track/pdf/10.1186/1758-2946-6-10.pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comments or Questions?\n", "\n", "- Email: \n", "- Website: https://www.dataschool.io\n", "- Twitter: [@justmarkham](https://twitter.com/justmarkham)\n", "\n", "© 2021 [Data School](https://www.dataschool.io). All rights reserved." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 1 }