{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross-Validation and the Test Set\n", "\n", "In the last lecture, we saw how keeping some data hidden from our model could help us to get a clearer understanding of whether or not the model was overfitting. This time, we'll introduce a common automated framework for handling this task, called **cross-validation**. We'll also incorporate a designated **test set**, which we won't touch until the very end of our analysis to get an overall view of the performance of our model. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSiblings/Spouses AboardParents/Children AboardFare
003Mr. Owen Harris Braundmale22.0107.2500
111Mrs. John Bradley (Florence Briggs Thayer) Cum...female38.01071.2833
213Miss. Laina Heikkinenfemale26.0007.9250
311Mrs. Jacques Heath (Lily May Peel) Futrellefemale35.01053.1000
403Mr. William Henry Allenmale35.0008.0500
...........................
88202Rev. Juozas Montvilamale27.00013.0000
88311Miss. Margaret Edith Grahamfemale19.00030.0000
88403Miss. Catherine Helen Johnstonfemale7.01223.4500
88511Mr. Karl Howell Behrmale26.00030.0000
88603Mr. Patrick Dooleymale32.0007.7500
\n", "

887 rows × 8 columns

\n", "
" ], "text/plain": [ " Survived Pclass Name \\\n", "0 0 3 Mr. Owen Harris Braund \n", "1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... \n", "2 1 3 Miss. Laina Heikkinen \n", "3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle \n", "4 0 3 Mr. William Henry Allen \n", ".. ... ... ... \n", "882 0 2 Rev. Juozas Montvila \n", "883 1 1 Miss. Margaret Edith Graham \n", "884 0 3 Miss. Catherine Helen Johnston \n", "885 1 1 Mr. Karl Howell Behr \n", "886 0 3 Mr. Patrick Dooley \n", "\n", " Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare \n", "0 male 22.0 1 0 7.2500 \n", "1 female 38.0 1 0 71.2833 \n", "2 female 26.0 0 0 7.9250 \n", "3 female 35.0 1 0 53.1000 \n", "4 male 35.0 0 0 8.0500 \n", ".. ... ... ... ... ... \n", "882 male 27.0 0 0 13.0000 \n", "883 female 19.0 0 0 30.0000 \n", "884 female 7.0 1 2 23.4500 \n", "885 male 26.0 0 0 30.0000 \n", "886 male 32.0 0 0 7.7500 \n", "\n", "[887 rows x 8 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# assumes that you have run the function retrieve_data() \n", "# from \"Introduction to ML in Practice\" in ML_3.ipynb\n", "titanic = pd.read_csv(\"data.csv\")\n", "titanic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are again going to use the `train_test_split` function to divide our data in two. This time, however, we are not going to be using the holdout data to determine the model complexity. Instead, we are going to hide the holdout data until the very end of our analysis. We'll use a different technique for handling the model complexity. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "np.random.seed(1234) # for reproducibility\n", "train, test = train_test_split(titanic, test_size = 0.2) # hold out 20% of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We again need to clean our data: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "def prep_titanic_data(data_df):\n", " df = data_df.copy()\n", " le = preprocessing.LabelEncoder()\n", " df['Sex'] = le.fit_transform(df['Sex'])\n", " df = df.drop(['Name'], axis = 1)\n", " \n", " X = df.drop(['Survived'], axis = 1).values\n", " y = df['Survived'].values\n", " \n", " return(X, y)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X_train, y_train = prep_titanic_data(train)\n", "X_test, y_test = prep_titanic_data(test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-fold Cross-Validation\n", "\n", "The idea of k-fold cross validation is to take a small piece of our training data, say 10%, and use that as a mini test set. We train the model on the remaining 90%, and then evaluate on the 10%. We then take a *different* 10%, train on the remaining 90%, and so on. We do this many times, and finally average the results to get an overall average picture of how the model might be expected to perform on the real test set. Cross-validation is a highly efficient tool for estimating the optimal complexity of a model. \n", "\n", "
\n", " \"Illustration\n", "
\n", " K-fold cross-validation. Source: scikit-learn docs.\n", "
\n", "\n", "The good folks at `scikit-learn` have implemented a function called `cross_val_score` which automates this entire process. It repeatedly selects holdout data; trains the model; and scores the model against the holdout data. While exceptions apply, you can often use `cross_val_score` as a plug-and-play replacement for `model.fit()` and `model.score()` during your model selection phase. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.8028169 , 0.73239437, 0.76056338, 0.81690141, 0.83098592,\n", " 0.8028169 , 0.81690141, 0.78873239, 0.85915493, 0.84285714])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "from sklearn import tree\n", "\n", "# make a model\n", "T = tree.DecisionTreeClassifier(max_depth = 3)\n", "\n", "# 10-fold cross validation: hold out 10%, train on the 90%, repeat 10 times. \n", "cv_scores = cross_val_score(T, X_train, y_train, cv=10)\n", "cv_scores" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8054124748490945" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_scores.mean()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1)\n", "\n", "best_score = 0\n", "\n", "for d in range(1,30):\n", " T = tree.DecisionTreeClassifier(max_depth = d)\n", " cv_score = cross_val_score(T, X_train, y_train, cv=10).mean()\n", " ax.scatter(d, cv_score, color = \"black\")\n", " if cv_score > best_score:\n", " best_depth = d\n", " best_score = cv_score\n", "\n", "l = ax.set(title = \"Best Depth : \" + str(best_depth),\n", " xlabel = \"Depth\", \n", " ylabel = \"CV Score\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a reasonable estimate of the optimal depth, we can try evaluating against the unseen testing data. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8426966292134831" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T = tree.DecisionTreeClassifier(max_depth = best_depth)\n", "T.fit(X_train, y_train)\n", "T.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We even got slightly higher accuracy on the test set than we did in validation, although this is rare." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Workflow: The Big Picture\n", "\n", "We now have all of the elements that we need to execute the core machine learning workflow. At a high-level, here's what should go into a machine task:\n", "\n", "1. Separate out the test set from your data. \n", "2. Clean and prepare your data if needed. It is best practice to clean your training and test data separately. It's convenient to write a function for this. \n", "3. Identify a set of candidate models (e.g. decision trees with depth up to 30, logistic models with between 1 and 3 variables, etc). \n", "4. Use a validation technique (k-fold cross-validation is usually sufficient) to estimate how your models will perform on the unseen test data. Select the best model as measured by validation. \n", "5. Finally, score the best model against the test set and report the result. \n", "\n", "Of course, this isn't all there is to data science -- you still need to do exploratory analysis; interpret your model; etc. etc. \n", "\n", "We'll discuss model interpretation further in a coming lecture. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }