{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross Validation\n", "> Holdout sets are a great start to model validation. However, using a single train and test set if often not enough. Cross-validation is considered the gold standard when it comes to validating model performance and is almost always used when tuning model hyper-parameters. This chapter focuses on performing cross-validation to validate model performance. This is the Summary of lecture \"Model Validation in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/loocv.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams['figure.figsize'] = (8, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The problems with holdout sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two samples\n", "After building several classification models based on the `tic_tac_toe` dataset, you realize that some models do not generalize as well as others. You have created training and testing splits just as you have been taught, so you are curious why your validation process is not working.\n", "\n", "After trying a different training, test split, you noticed differing accuracies for your machine learning model. Before getting too frustrated with the varying results, you have decided to see what else could be going on." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Top-LeftTop-MiddleTop-RightMiddle-LeftMiddle-MiddleMiddle-RightBottom-LeftBottom-MiddleBottom-RightClass
0xxxxooxoopositive
1xxxxoooxopositive
2xxxxooooxpositive
3xxxxooobbpositive
4xxxxoobobpositive
\n", "
" ], "text/plain": [ " Top-Left Top-Middle Top-Right Middle-Left Middle-Middle Middle-Right \\\n", "0 x x x x o o \n", "1 x x x x o o \n", "2 x x x x o o \n", "3 x x x x o o \n", "4 x x x x o o \n", "\n", " Bottom-Left Bottom-Middle Bottom-Right Class \n", "0 x o o positive \n", "1 o x o positive \n", "2 o o x positive \n", "3 o b b positive \n", "4 b o b positive " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tic_tac_toe = pd.read_csv('./dataset/tic-tac-toe.csv')\n", "tic_tac_toe.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40\n", "positive 134\n", "negative 66\n", "Name: Class, dtype: int64\n", "positive 123\n", "negative 77\n", "Name: Class, dtype: int64\n" ] } ], "source": [ "# Create two different samples of 200 observations\n", "sample1 = tic_tac_toe.sample(n=200, random_state=1111)\n", "sample2 = tic_tac_toe.sample(n=200, random_state=1171)\n", "\n", "# Print the number of common observations\n", "print(len([index for index in sample1.index if index in sample2.index]))\n", "\n", "# Print the number of observations in the Class column for both samples\n", "print(sample1['Class'].value_counts())\n", "print(sample2['Class'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there are a varying number of positive observations for both sample test sets. Sometimes creating a single test holdout sample is not enough to achieve the high levels of model validation you want. You need to use something more robust." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation\n", "![cv](image/cv.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### scikit-learn's KFold()\n", "You just finished running a colleagues code that creates a random forest model and calculates an out-of-sample accuracy. You noticed that your colleague's code did not have a random state, and the errors you found were completely different than the errors your colleague reported.\n", "\n", "To get a better estimate for how accurate this random forest model will be on new data, you have decided to generate some indices to use for KFold cross-validation." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
competitornamechocolatefruitycaramelpeanutyalmondynougatcrispedricewaferhardbarpluribussugarpercentpricepercentwinpercent
0100 Grand1010010100.7320.86066.971725
13 Musketeers1000100100.6040.51167.602936
2One dime0000000000.0110.11632.261086
3One quarter0000000000.0110.51146.116505
4Air Heads0100000000.9060.51152.341465
\n", "
" ], "text/plain": [ " competitorname chocolate fruity caramel peanutyalmondy nougat \\\n", "0 100 Grand 1 0 1 0 0 \n", "1 3 Musketeers 1 0 0 0 1 \n", "2 One dime 0 0 0 0 0 \n", "3 One quarter 0 0 0 0 0 \n", "4 Air Heads 0 1 0 0 0 \n", "\n", " crispedricewafer hard bar pluribus sugarpercent pricepercent \\\n", "0 1 0 1 0 0.732 0.860 \n", "1 0 0 1 0 0.604 0.511 \n", "2 0 0 0 0 0.011 0.116 \n", "3 0 0 0 0 0.011 0.511 \n", "4 0 0 0 0 0.906 0.511 \n", "\n", " winpercent \n", "0 66.971725 \n", "1 67.602936 \n", "2 32.261086 \n", "3 46.116505 \n", "4 52.341465 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "candy = pd.read_csv('./dataset/candy-data.csv')\n", "candy.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X = candy.drop(['competitorname', 'winpercent'], axis=1).to_numpy()\n", "y = candy['winpercent'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of training indices: 68\n", "Number of validation indices: 17\n", "Number of training indices: 68\n", "Number of validation indices: 17\n", "Number of training indices: 68\n", "Number of validation indices: 17\n", "Number of training indices: 68\n", "Number of validation indices: 17\n", "Number of training indices: 68\n", "Number of validation indices: 17\n" ] } ], "source": [ "from sklearn.model_selection import KFold\n", "\n", "# Use KFold\n", "kf = KFold(n_splits=5, shuffle=True, random_state=1111)\n", "\n", "# Create splits\n", "splits = kf.split(X)\n", "\n", "# Print the number of indices\n", "for train_index, val_index in splits:\n", " print(\"Number of training indices: %s\" % len(train_index))\n", " print(\"Number of validation indices: %s\" % len(val_index))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset has 85 rows. You have created five splits - each containing 68 training and 17 validation indices. You can use these indices to complete 5-fold cross-validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using KFold indices\n", "You have already created `splits`, which contains indices for the candy-data dataset to complete 5-fold cross-validation. To get a better estimate for how well a colleague's random forest model will perform on a new data, you want to run this model on the five different training and validation indices you just created.\n", "\n", "In this exercise, you will use these indices to check the accuracy of this model using the five different splits. A for loop has been provided to assist with this process." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Create splits\n", "splits = kf.split(X)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Split accuracy: 151.5028145199104\n", "Split accuracy: 173.4624060357644\n", "Split accuracy: 132.7340977072911\n", "Split accuracy: 81.50364942339418\n", "Split accuracy: 217.17904656079338\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import mean_squared_error\n", "\n", "rfc = RandomForestRegressor(n_estimators=25, random_state=1111)\n", "\n", "# Access the training and validation indices of splits\n", "for train_index, val_index in splits:\n", " # Setup the training and validation data\n", " X_train, y_train = X[train_index], y[train_index]\n", " X_val, y_val = X[val_index], y[val_index]\n", " \n", " # Fit the random forest model\n", " rfc.fit(X_train, y_train)\n", " \n", " # Make predictions, and print the accuracy\n", " predictions = rfc.predict(X_val)\n", " print(\"Split accuracy: \" + str(mean_squared_error(y_val, predictions)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`KFold()` is a great method for accessing individual indices when completing cross-validation. One drawback is needing a for loop to work through the indices though." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## sklearn's cross_val_score()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### scikit-learn's methods\n", "You have decided to build a regression model to predict the number of new employees your company will successfully hire next month. You open up a new Python script to get started, but you quickly realize that sklearn has a lot of different modules. Let's make sure you understand the names of the modules, the methods, and which module contains which method.\n", "\n", "Follow the instructions below to load in all of the necessary methods for completing cross-validation using sklearn. You will use modules:\n", "\n", "- metrics\n", "- model_selection\n", "- ensemble" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import mean_squared_error, make_scorer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implement cross_val_score()\n", "Your company has created several new candies to sell, but they are not sure if they should release all five of them. To predict the popularity of these new candies, you have been asked to build a regression model using the candy dataset. Remember that the response value is a head-to-head win-percentage against other candies.\n", "\n", "Before you begin trying different regression models, you have decided to run cross-validation on a simple random forest model to get a baseline error to compare with any future results." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "130.91371947185584\n" ] } ], "source": [ "rfc = RandomForestRegressor(n_estimators=25, random_state=1111)\n", "mse = make_scorer(mean_squared_error)\n", "\n", "# Setup cross_val_score\n", "cv = cross_val_score(estimator=rfc, X=X_train, y=y_train, cv=10, scoring=mse)\n", "\n", "# Print the mean error\n", "print(cv.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You now have a baseline score to build on. If you decide to build additional models or try new techniques, you should try to get an error lower than 155.56. Lower errors indicate that your popularity predictions are improving." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Leave-one-out-cross-validation (LOOCV)\n", "- LOOCV\n", "![loocv](image/loocv.png)\n", "- When to use LOOCV?\n", " - The amount of training data is limited\n", " - You want the absolute best error estimate for new data\n", "- Be cautious when:\n", " - Computation resources are limited\n", " - You have a lot of data\n", " - You have a lot of parameters to test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Leave-one-out-cross-validation\n", "Let's assume your favorite candy is not in the candy dataset, and that you are interested in the popularity of this candy. Using 5-fold cross-validation will train on only 80% of the data at a time. The candy dataset only has 85 rows though, and leaving out 20% of the data could hinder our model. However, using leave-one-out-cross-validation allows us to make the most out of our limited dataset and will give you the best estimate for your favorite candy's popularity!\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mean of the errors is: 9.464989603398694.\n", "The standard deviation of the errors is: 7.265762094853885.\n" ] } ], "source": [ "from sklearn.metrics import mean_absolute_error\n", "# Create scorer\n", "mae_scorer = make_scorer(mean_absolute_error)\n", "\n", "rfr = RandomForestRegressor(n_estimators=15, random_state=1111)\n", "\n", "# Implement LOOCV\n", "scores = cross_val_score(rfr, X, y, cv=85, scoring=mae_scorer)\n", "\n", "# Print the mean and standard deviation\n", "print(\"The mean of the errors is: %s.\" % np.mean(scores))\n", "print(\"The standard deviation of the errors is: %s.\" % np.std(scores))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have come along way with model validation techniques. The final chapter will wrap up model validation by discussing how to select the best model and give an introduction to parameter tuning." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }