{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overfitting II\n", "\n", "Last time, we saw a theoretical example of *overfitting*, in which we fit a machine learning model that perfectly fit the data it saw, but performed extremely poorly on fresh, unseen data. In this lecture, we'll observe overfitting in a more practical context, using the Titanic data set again. We'll then begin to study *validation* techniques for finding models with \"just the right amount\" of flexibility. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSiblings/Spouses AboardParents/Children AboardFare
003Mr. Owen Harris Braundmale22.0107.2500
111Mrs. John Bradley (Florence Briggs Thayer) Cum...female38.01071.2833
213Miss. Laina Heikkinenfemale26.0007.9250
311Mrs. Jacques Heath (Lily May Peel) Futrellefemale35.01053.1000
403Mr. William Henry Allenmale35.0008.0500
...........................
88202Rev. Juozas Montvilamale27.00013.0000
88311Miss. Margaret Edith Grahamfemale19.00030.0000
88403Miss. Catherine Helen Johnstonfemale7.01223.4500
88511Mr. Karl Howell Behrmale26.00030.0000
88603Mr. Patrick Dooleymale32.0007.7500
\n", "

887 rows × 8 columns

\n", "
" ], "text/plain": [ " Survived Pclass Name \\\n", "0 0 3 Mr. Owen Harris Braund \n", "1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... \n", "2 1 3 Miss. Laina Heikkinen \n", "3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle \n", "4 0 3 Mr. William Henry Allen \n", ".. ... ... ... \n", "882 0 2 Rev. Juozas Montvila \n", "883 1 1 Miss. Margaret Edith Graham \n", "884 0 3 Miss. Catherine Helen Johnston \n", "885 1 1 Mr. Karl Howell Behr \n", "886 0 3 Mr. Patrick Dooley \n", "\n", " Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare \n", "0 male 22.0 1 0 7.2500 \n", "1 female 38.0 1 0 71.2833 \n", "2 female 26.0 0 0 7.9250 \n", "3 female 35.0 1 0 53.1000 \n", "4 male 35.0 0 0 8.0500 \n", ".. ... ... ... ... ... \n", "882 male 27.0 0 0 13.0000 \n", "883 female 19.0 0 0 30.0000 \n", "884 female 7.0 1 2 23.4500 \n", "885 male 26.0 0 0 30.0000 \n", "886 male 32.0 0 0 7.7500 \n", "\n", "[887 rows x 8 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# assumes that you have run the function retrieve_data() \n", "# from \"Introduction to ML in Practice\" in ML_3.ipynb\n", "\n", "titanic = pd.read_csv(\"data.csv\")\n", "titanic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that we diagnosed overfitting by testing our model against some new data. In this case, we don't have any more data. So, what we can do instead is *hold out* some data that we won't let our model see at first. This holdout data is called the *validation* or *testing* data, depending on the use to which we put it. In contrast, the data that we allow our model to see is called the *training* data. `sklearn` provides a convenient function for partitioning our data into training and holdout sets called `train_test_split`. The default and generally most useful behavior is to randomly select rows of the data frame to be in each set. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((620, 8), (267, 8))" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "np.random.seed(1234)\n", "train, test = train_test_split(titanic, test_size = 0.3) # hold out 30% of the data\n", "train.shape, test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have two data frames. As you may recall from a previous lecture, we need to do some data cleaning, and split them into predictor variables `X` and target variables `y`. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "def prep_titanic_data(data_df):\n", " df = data_df.copy()\n", " \n", " # convert male/female to 1/0\n", " le = preprocessing.LabelEncoder()\n", " df['Sex'] = le.fit_transform(df['Sex'])\n", " \n", " # don't need name column\n", " df = df.drop(['Name'], axis = 1)\n", " \n", " # split into X and y\n", " X = df.drop(['Survived'], axis = 1)\n", " y = df['Survived']\n", " \n", " return(X, y)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "X_train, y_train = prep_titanic_data(train)\n", "X_test, y_test = prep_titanic_data(test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're able to train our model on the `train` data, and then evaluate its performance on the `val` data. This will help us to diagnose and avoid overfitting.\n", "\n", "Let's try using the decision tree classifier again. As you may remember, the `DecisionTreeClassifier()` class takes an argument `max_depth` that governs how many layers of decisions the tree is allowed to make. Larger `max_depth` values correspond to more complicated trees. In this way, `max_depth` is a model complexity parameter, similar to the `degree` when we did polynomial regression. \n", "\n", "For example, with a small `max_depth`, the model scores on the training and validation data are relatively close. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.8290322580645161, 0.8164794007490637)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import tree\n", "\n", "T = tree.DecisionTreeClassifier(max_depth = 3)\n", "\n", "T.fit(X_train, y_train)\n", "T.score(X_train, y_train), T.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, if we use a much higher `max_depth`, we can achieve a substantially better score on the training data, but our performance on the validation data has not improved by much, and might even suffer. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.9903225806451613, 0.7602996254681648)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T = tree.DecisionTreeClassifier(max_depth = 20)\n", "\n", "T.fit(X_train, y_train)\n", "T.score(X_train, y_train), T.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks like overfitting! The model achieves a near-perfect score on the training data, but a much lower one on the test data. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Text(0, 0.5, 'Performance (score)'), Text(0.5, 0, 'Complexity (depth)')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1, figsize = (10, 7))\n", "\n", "for d in range(1, 30):\n", " T = tree.DecisionTreeClassifier(max_depth = d)\n", "\n", " T.fit(X_train, y_train)\n", " \n", " ax.scatter(d, T.score(X_train, y_train), color = \"black\")\n", " ax.scatter(d, T.score(X_test, y_test), color = \"firebrick\")\n", "\n", "ax.set(xlabel = \"Complexity (depth)\", ylabel = \"Performance (score)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that the training score (black) always increases, while the test score (red) tops out around 83\\% and then even begins to trail off slightly. It looks like the optimal depth might be around 5-7 or so, but there's some random noise that can prevent us from being able to determine exactly what the optimal depth is. \n", "\n", "Increasing performance on the training set combined with decreasing performance on the test set is the trademark of overfitting. \n", "\n", "This noise reflects the fact that we took a single, random subset of the data for testing. In a more systematic experiment, we would draw many different subsets of the data for each value of depth and average over them. This is what *cross-validation* does, and we'll talk about it in the next lecture." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }