{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4 Pre-Processing and Training Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.1 Contents\n",
"* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)\n",
" * [4.1 Contents](#4.1_Contents)\n",
" * [4.2 Introduction](#4.2_Introduction)\n",
" * [4.3 Imports](#4.3_Imports)\n",
" * [4.4 Load Data](#4.4_Load_Data)\n",
" * [4.5 Extract Big Mountain Data](#4.5_Extract_Big_Mountain_Data)\n",
" * [4.6 Train/Test Split](#4.6_Train/Test_Split)\n",
" * [4.7 Initial Not-Even-A-Model](#4.7_Initial_Not-Even-A-Model)\n",
" * [4.7.1 Metrics](#4.7.1_Metrics)\n",
" * [4.7.1.1 R-squared, or coefficient of determination](#4.7.1.1_R-squared,_or_coefficient_of_determination)\n",
" * [4.7.1.2 Mean Absolute Error](#4.7.1.2_Mean_Absolute_Error)\n",
" * [4.7.1.3 Mean Squared Error](#4.7.1.3_Mean_Squared_Error)\n",
" * [4.7.2 sklearn metrics](#4.7.2_sklearn_metrics)\n",
" * [4.7.2.0.1 R-squared](#4.7.2.0.1_R-squared)\n",
" * [4.7.2.0.2 Mean absolute error](#4.7.2.0.2_Mean_absolute_error)\n",
" * [4.7.2.0.3 Mean squared error](#4.7.2.0.3_Mean_squared_error)\n",
" * [4.7.3 Note On Calculating Metrics](#4.7.3_Note_On_Calculating_Metrics)\n",
" * [4.8 Initial Models](#4.8_Initial_Models)\n",
" * [4.8.1 Imputing missing feature (predictor) values](#4.8.1_Imputing_missing_feature_(predictor)_values)\n",
" * [4.8.1.1 Impute missing values with median](#4.8.1.1_Impute_missing_values_with_median)\n",
" * [4.8.1.1.1 Learn the values to impute from the train set](#4.8.1.1.1_Learn_the_values_to_impute_from_the_train_set)\n",
" * [4.8.1.1.2 Apply the imputation to both train and test splits](#4.8.1.1.2_Apply_the_imputation_to_both_train_and_test_splits)\n",
" * [4.8.1.1.3 Scale the data](#4.8.1.1.3_Scale_the_data)\n",
" * [4.8.1.1.4 Train the model on the train split](#4.8.1.1.4_Train_the_model_on_the_train_split)\n",
" * [4.8.1.1.5 Make predictions using the model on both train and test splits](#4.8.1.1.5_Make_predictions_using_the_model_on_both_train_and_test_splits)\n",
" * [4.8.1.1.6 Assess model performance](#4.8.1.1.6_Assess_model_performance)\n",
" * [4.8.1.2 Impute missing values with the mean](#4.8.1.2_Impute_missing_values_with_the_mean)\n",
" * [4.8.1.2.1 Learn the values to impute from the train set](#4.8.1.2.1_Learn_the_values_to_impute_from_the_train_set)\n",
" * [4.8.1.2.2 Apply the imputation to both train and test splits](#4.8.1.2.2_Apply_the_imputation_to_both_train_and_test_splits)\n",
" * [4.8.1.2.3 Scale the data](#4.8.1.2.3_Scale_the_data)\n",
" * [4.8.1.2.4 Train the model on the train split](#4.8.1.2.4_Train_the_model_on_the_train_split)\n",
" * [4.8.1.2.5 Make predictions using the model on both train and test splits](#4.8.1.2.5_Make_predictions_using_the_model_on_both_train_and_test_splits)\n",
" * [4.8.1.2.6 Assess model performance](#4.8.1.2.6_Assess_model_performance)\n",
" * [4.8.2 Pipelines](#4.8.2_Pipelines)\n",
" * [4.8.2.1 Define the pipeline](#4.8.2.1_Define_the_pipeline)\n",
" * [4.8.2.2 Fit the pipeline](#4.8.2.2_Fit_the_pipeline)\n",
" * [4.8.2.3 Make predictions on the train and test sets](#4.8.2.3_Make_predictions_on_the_train_and_test_sets)\n",
" * [4.8.2.4 Assess performance](#4.8.2.4_Assess_performance)\n",
" * [4.9 Refining The Linear Model](#4.9_Refining_The_Linear_Model)\n",
" * [4.9.1 Define the pipeline](#4.9.1_Define_the_pipeline)\n",
" * [4.9.2 Fit the pipeline](#4.9.2_Fit_the_pipeline)\n",
" * [4.9.3 Assess performance on the train and test set](#4.9.3_Assess_performance_on_the_train_and_test_set)\n",
" * [4.9.4 Define a new pipeline to select a different number of features](#4.9.4_Define_a_new_pipeline_to_select_a_different_number_of_features)\n",
" * [4.9.5 Fit the pipeline](#4.9.5_Fit_the_pipeline)\n",
" * [4.9.6 Assess performance on train and test data](#4.9.6_Assess_performance_on_train_and_test_data)\n",
" * [4.9.7 Assessing performance using cross-validation](#4.9.7_Assessing_performance_using_cross-validation)\n",
" * [4.9.8 Hyperparameter search using GridSearchCV](#4.9.8_Hyperparameter_search_using_GridSearchCV)\n",
" * [4.10 Random Forest Model](#4.10_Random_Forest_Model)\n",
" * [4.10.1 Define the pipeline](#4.10.1_Define_the_pipeline)\n",
" * [4.10.2 Fit and assess performance using cross-validation](#4.10.2_Fit_and_assess_performance_using_cross-validation)\n",
" * [4.10.3 Hyperparameter search using GridSearchCV](#4.10.3_Hyperparameter_search_using_GridSearchCV)\n",
" * [4.11 Final Model Selection](#4.11_Final_Model_Selection)\n",
" * [4.11.1 Linear regression model performance](#4.11.1_Linear_regression_model_performance)\n",
" * [4.11.2 Random forest regression model performance](#4.11.2_Random_forest_regression_model_performance)\n",
" * [4.11.3 Conclusion](#4.11.3_Conclusion)\n",
" * [4.12 Data quantity assessment](#4.12_Data_quantity_assessment)\n",
" * [4.13 Save best model object from pipeline](#4.13_Save_best_model_object_from_pipeline)\n",
" * [4.14 Summary](#4.14_Summary)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.2 Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In preceding notebooks, we performed preliminary assessments of data quality and refined the question to be answered. We found a small number of observations that clearly indicated whether to replace values or drop a whole row. We determined that predicting the adult weekend ticket price was our primary aim. We threw away records with missing price data, but not before making the most of the other available data to look for any patterns among the states. We didn't see any and decided to treat all states equally; the state label didn't seem to be particularly useful.\n",
"\n",
"In this notebook, we'll start to build machine learning models. Before diving into a machine learning model, however, we'll start by considering how useful the mean value is as a predictor. We never want to go to stakeholders with a machine learning model only to have the CEO point out that it performs worse than just guessing the average! Our first model is always a baseline performance comparitor for any subsequent model. Next, we'll build up the process of efficiently creating robust models to compare to our baseline forecast. We can validate steps with our own functions for checking expected equivalences between, say, pandas and sklearn implementations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.3 Imports"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"import pickle\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn import __version__ as sklearn_version\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.preprocessing import scale\n",
"from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve\n",
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"from sklearn.dummy import DummyRegressor\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.feature_selection import SelectKBest, f_regression\n",
"import datetime\n",
"\n",
"from library.sb_utils import save_file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.4 Load Data"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" 124\n",
"Name Big Mountain Resort\n",
"Region Montana\n",
"state Montana\n",
"summit_elev 6817\n",
"vertical_drop 2353\n",
"base_elev 4464\n",
"trams 0\n",
"fastSixes 0\n",
"fastQuads 3\n",
"quad 2\n",
"triple 6\n",
"double 0\n",
"surface 3\n",
"total_chairs 14\n",
"Runs 105.0\n",
"TerrainParks 4.0\n",
"LongestRun_mi 3.3\n",
"SkiableTerrain_ac 3000.0\n",
"Snow Making_ac 600.0\n",
"daysOpenLastYear 123.0\n",
"yearsOpen 72.0\n",
"averageSnowfall 333.0\n",
"AdultWeekend 81.0\n",
"projectedDaysOpen 123.0\n",
"NightSkiing_ac 600.0\n",
"resorts_per_state 12\n",
"resorts_per_100kcapita 1.122778\n",
"resorts_per_100ksq_mile 8.161045\n",
"resort_skiable_area_ac_state_ratio 0.140121\n",
"resort_days_open_state_ratio 0.129338\n",
"resort_terrain_park_state_ratio 0.148148\n",
"resort_night_skiing_state_ratio 0.84507\n",
"total_chairs_runs_ratio 0.133333\n",
"total_chairs_skiable_ratio 0.004667\n",
"fastQuads_runs_ratio 0.028571\n",
"fastQuads_skiable_ratio 0.001"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"big_mountain.T"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(278, 36)"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ski_data.shape"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"ski_data = ski_data[ski_data.Name != 'Big Mountain Resort']"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(277, 36)"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ski_data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.6 Train/Test Split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far, we've treated ski resort data as a single entity. In machine learning, when we train our model on all of our data, we end up with no data set aside to evaluate model performance. We could keep making more and more complex models that fit the data better and better and not realise we are overfitting the model. By partitioning the data into training and testing splits, without letting a model (or missing-value imputation) learn anything about the test split, we have a somewhat independent assessment of how our model might perform in the future. An often overlooked subtlety here is that people all too frequently use the test set to assess model performance _and then compare multiple models to pick the best_. This means their overall model selection process is flawed: The engineer picks the model sans help from the test set. Instead we use held-out data and/or k-fold cross-validation to simulate additional test sets and assess model performance. The formal test set is very useful as a final check on expected future performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What partition sizes would we have with a 70/30 train/test split?"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(193.89999999999998, 83.1)"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(ski_data) * .7, len(ski_data) * .3"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"# Generate test and train sets for X and Y variables; set random state to get reproducable results\n",
"X_train, X_test, y_train, y_test = train_test_split(ski_data.drop(columns='AdultWeekend'), \n",
" ski_data.AdultWeekend, test_size=0.3, \n",
" random_state=47)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((193, 35), (84, 35))"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the shapes of X train and test sets\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((193,), (84,))"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the shapes of train Y train and test sets\n",
"y_train.shape, y_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((193, 32), (84, 32))"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"names_list = ['Name', 'state', 'Region']\n",
"names_train = X_train[names_list]\n",
"names_test = X_test[names_list]\n",
"X_train.drop(columns=names_list, inplace=True)\n",
"X_test.drop(columns=names_list, inplace=True)\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"summit_elev int64\n",
"vertical_drop int64\n",
"base_elev int64\n",
"trams int64\n",
"fastSixes int64\n",
"fastQuads int64\n",
"quad int64\n",
"triple int64\n",
"double int64\n",
"surface int64\n",
"total_chairs int64\n",
"Runs float64\n",
"TerrainParks float64\n",
"LongestRun_mi float64\n",
"SkiableTerrain_ac float64\n",
"Snow Making_ac float64\n",
"daysOpenLastYear float64\n",
"yearsOpen float64\n",
"averageSnowfall float64\n",
"projectedDaysOpen float64\n",
"NightSkiing_ac float64\n",
"resorts_per_state int64\n",
"resorts_per_100kcapita float64\n",
"resorts_per_100ksq_mile float64\n",
"resort_skiable_area_ac_state_ratio float64\n",
"resort_days_open_state_ratio float64\n",
"resort_terrain_park_state_ratio float64\n",
"resort_night_skiing_state_ratio float64\n",
"total_chairs_runs_ratio float64\n",
"total_chairs_skiable_ratio float64\n",
"fastQuads_runs_ratio float64\n",
"fastQuads_skiable_ratio float64\n",
"dtype: object"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the `dtypes` attribute of `X_train` to verify all features are numeric\n",
"X_train.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"summit_elev int64\n",
"vertical_drop int64\n",
"base_elev int64\n",
"trams int64\n",
"fastSixes int64\n",
"fastQuads int64\n",
"quad int64\n",
"triple int64\n",
"double int64\n",
"surface int64\n",
"total_chairs int64\n",
"Runs float64\n",
"TerrainParks float64\n",
"LongestRun_mi float64\n",
"SkiableTerrain_ac float64\n",
"Snow Making_ac float64\n",
"daysOpenLastYear float64\n",
"yearsOpen float64\n",
"averageSnowfall float64\n",
"projectedDaysOpen float64\n",
"NightSkiing_ac float64\n",
"resorts_per_state int64\n",
"resorts_per_100kcapita float64\n",
"resorts_per_100ksq_mile float64\n",
"resort_skiable_area_ac_state_ratio float64\n",
"resort_days_open_state_ratio float64\n",
"resort_terrain_park_state_ratio float64\n",
"resort_night_skiing_state_ratio float64\n",
"total_chairs_runs_ratio float64\n",
"total_chairs_skiable_ratio float64\n",
"fastQuads_runs_ratio float64\n",
"fastQuads_skiable_ratio float64\n",
"dtype: object"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Repeat this check for the test split in `X_test`\n",
"X_test.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have only numeric features in X now!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.7 Initial Not-Even-A-Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll begin by determining how good the mean is as a predictor. In other words, what if we simply say our best guess is the average price?"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"63.84730569948186"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Calculate the mean of `y_train`\n",
"train_mean = y_train.mean()\n",
"train_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sklearn`'s `DummyRegressor` easily does this:"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[63.8473057]])"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Sanity Check\n",
"dumb_reg = DummyRegressor(strategy='mean')\n",
"dumb_reg.fit(X_train, y_train)\n",
"dumb_reg.constant_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Having established the grand mean, we need to determine how closely it matches, or explains, the actual values. There are many ways of assessing how good one set of values agrees with another, which brings us to the subject of metrics."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.7.1 Metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.7.1.1 R-squared, or coefficient of determination"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One measure is $R^2$, the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination). This is a measure of the proportion of variance in the dependent variable (our ticket price) that is predicted by our \"model\". The linked Wikipedia articles gives a nice explanation of how negative values can arise. This is frequently a cause of confusion for newcomers who, reasonably, ask how can a squared value be negative?\n",
"\n",
"Recall the mean can be denoted by $\\bar{y}$, where\n",
"\n",
"$$\\bar{y} = \\frac{1}{n}\\sum_{i=1}^ny_i$$\n",
"\n",
"and where $y_i$ are the individual values of the dependent variable.\n",
"\n",
"The total sum of squares (error), can be expressed as\n",
"\n",
"$$SS_{tot} = \\sum_i(y_i-\\bar{y})^2$$\n",
"\n",
"The above formula should be familiar as it's simply the variance without the denominator to scale (divide) by the sample size.\n",
"\n",
"The residual sum of squares is similarly defined to be\n",
"\n",
"$$SS_{res} = \\sum_i(y_i-\\hat{y})^2$$\n",
"\n",
"where $\\hat{y}$ are our predicted values for the depended variable.\n",
"\n",
"The coefficient of determination, $R^2$, here is given by\n",
"\n",
"$$R^2 = 1 - \\frac{SS_{res}}{SS_{tot}}$$\n",
"\n",
"Putting it into words, it's one minus the ratio of the residual variance to the original variance. Thus, the baseline model here, which always predicts $\\bar{y}$, should give $R^2=0$. A model that perfectly predicts the observed values would have no residual error and so give $R^2=1$. Models that do worse than predicting the mean will have increased the sum of squares of residuals and so produce a negative $R^2$."
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"#Calculate the R^2 as defined above\n",
"def r_squared(y, ypred):\n",
" \"\"\"R-squared score.\n",
" \n",
" Calculate the R-squared, or coefficient of determination, of the input.\n",
" \n",
" Arguments:\n",
" y -- the observed values\n",
" ypred -- the predicted values\n",
" \"\"\"\n",
" ybar = np.mean(y)\n",
" sum_sq_tot = np.sum((y - ybar)**2)\n",
" sum_sq_res = np.sum((y - ypred)**2)\n",
" R2 = 1.0 - sum_sq_res / sum_sq_tot\n",
" return R2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We make our predictions by creating an array of length the size of the training set with the single value of the mean."
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_tr_pred_ = train_mean * np.ones(len(y_train))\n",
"y_tr_pred_[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember the `sklearn` dummy regressor? "
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_tr_pred = dumb_reg.predict(X_train)\n",
"y_tr_pred[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that `DummyRegressor` produces exactly the same results and saves us from having to broadcast the mean (or whichever other statistic we used - check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html) to see what's available) to an array of the appropriate length. It also gives us an object with `fit()` and `predict()` methods as well, so we can use them as conveniently as any other `sklearn` estimator."
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r_squared(y_train, y_tr_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exactly as expected, if we use the average value as our prediction, we get an $R^2$ of zero _on our training set_. What if we use this \"model\" to predict unseen values from the test set? Remember, of course, that our \"model\" is trained on the training set; we still use the training set mean as our prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make predictions by creating an array of length the size of the test set with the single value of the (training) mean."
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-0.0015364646867073173"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_te_pred = train_mean * np.ones(len(y_test))\n",
"r_squared(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generally, you can expect performance on a test set to be slightly worse than on the training set. As you are getting an $R^2$ of zero on the training set, there's nowhere to go but negative!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$R^2$ is a common metric, and interpretable in terms of the amount of variance explained, it's less appealing if we want an idea of how \"close\" our predictions are to the true values. Metrics that summarise the difference between predicted and actual values are _mean absolute error_ and _mean squared error_."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.7.1.2 Mean Absolute Error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is very simply the average of the absolute errors:\n",
"\n",
"$$MAE = \\frac{1}{n}\\sum_i^n|y_i - \\hat{y}|$$"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [],
"source": [
"def mae(y, ypred):\n",
" \"\"\"Mean absolute error.\n",
" \n",
" Calculate the mean absolute error of the arguments\n",
"\n",
" Arguments:\n",
" y -- the observed values\n",
" ypred -- the predicted values\n",
" \"\"\"\n",
" abs_error = np.abs(y - ypred)\n",
" mae = np.mean(abs_error)\n",
" return mae"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"18.149503610835193"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mae(y_train, y_tr_pred)"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"18.672179249938317"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mae(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mean absolute error is arguably the most intuitive of all the metrics, this essentially tells you that, on average, you might expect to be off by around \\\\$19 if you guessed ticket price based on an average of known values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.7.1.3 Mean Squared Error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another common metric (and an important one internally for optimizing machine learning models) is the mean squared error. This is simply the average of the square of the errors:\n",
"\n",
"$$MSE = \\frac{1}{n}\\sum_i^n(y_i - \\hat{y})^2$$"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Calculate the MSE as defined above\n",
"def mse(y, ypred):\n",
" \"\"\"Mean square error.\n",
" \n",
" Calculate the mean square error of the arguments\n",
"\n",
" Arguments:\n",
" y -- the observed values\n",
" ypred -- the predicted values\n",
" \"\"\"\n",
" sq_error = (y - ypred)**2\n",
" mse = np.mean(sq_error)\n",
" return mse"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"616.9493046578431"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mse(y_train, y_tr_pred)"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"574.1671108060107"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mse(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So here, we get a slightly better MSE on the test set than we did on the train set. And what does a squared error mean anyway? To convert this back to our measurement space, we often take the square root, to form the _root mean square error_ thus:"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([24.83846422, 23.96178438])"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sqrt([mse(y_train, y_tr_pred), mse(y_test, y_te_pred)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.7.2 sklearn metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Functions are good, but you don't want to have to define functions every time we want to assess performance. `sklearn.metrics` provides many commonly used metrics, included the ones above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.7.2.0.1 R-squared"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, -0.0015364646867073173)"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.7.2.0.2 Mean absolute error"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(18.149503610835193, 18.672179249938317)"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.7.2.0.3 Mean squared error"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(616.9493046578431, 574.1671108060107)"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.7.3 Note On Calculating Metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In a Jupyter code cell, running `r2_score?` will bring up the docstring for the function, and `r2_score??` will bring up the actual code of the function! Here we try it and compare the source for `sklearn`'s function with ours."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, -3.054984985780873e+30)"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# train set - sklearn\n",
"# correct order, incorrect order\n",
"r2_score(y_train, y_tr_pred), r2_score(y_tr_pred, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(-0.0015364646867073173, -2.8431378228302645e+30)"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# test set - sklearn\n",
"# correct order, incorrect order\n",
"r2_score(y_test, y_te_pred), r2_score(y_te_pred, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, -3.054984985780873e+30)"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# train set - using our homebrew function\n",
"# correct order, incorrect order\n",
"r_squared(y_train, y_tr_pred), r_squared(y_tr_pred, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(-0.0015364646867073173, -2.8431378228302645e+30)"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# test set - using our homebrew function\n",
"# correct order, incorrect order\n",
"r_squared(y_test, y_te_pred), r_squared(y_te_pred, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get very different results swapping the argument order. It's worth highlighting this because data scientists do this too much in the real world! Frequently the argument order doesn't matter, but it will bite when we do it with a function that does care. It's sloppy, bad practice and if we don't make a habit of putting arguments in the right order, we stand to forget!\n",
"\n",
"Remember:\n",
"* argument order matters,\n",
"* check function syntax with `func?` in a code cell"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.8 Initial Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.8.1 Imputing missing feature (predictor) values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall when performing EDA, we imputed (filled in) some missing values in Pandas. We can impute missing values using scikit-learn, but we will prioritize imputation from a train split and apply that to the test split to then assess how well our imputation worked."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.8.1.1 Impute missing values with median"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have missing values. Recall from our data exploration that many distributions were skewed. Our first thought might be to impute missing values using the median."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.1.1 Learn the values to impute from the train set"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"summit_elev 2175.000000\n",
"vertical_drop 750.000000\n",
"base_elev 1280.000000\n",
"trams 0.000000\n",
"fastSixes 0.000000\n",
"fastQuads 0.000000\n",
"quad 0.000000\n",
"triple 1.000000\n",
"double 1.000000\n",
"surface 2.000000\n",
"total_chairs 6.000000\n",
"Runs 29.000000\n",
"TerrainParks 2.000000\n",
"LongestRun_mi 1.000000\n",
"SkiableTerrain_ac 170.000000\n",
"Snow Making_ac 96.500000\n",
"daysOpenLastYear 107.000000\n",
"yearsOpen 57.000000\n",
"averageSnowfall 120.000000\n",
"projectedDaysOpen 112.000000\n",
"NightSkiing_ac 70.000000\n",
"resorts_per_state 15.000000\n",
"resorts_per_100kcapita 0.248243\n",
"resorts_per_100ksq_mile 24.428973\n",
"resort_skiable_area_ac_state_ratio 0.050000\n",
"resort_days_open_state_ratio 0.070595\n",
"resort_terrain_park_state_ratio 0.069444\n",
"resort_night_skiing_state_ratio 0.066804\n",
"total_chairs_runs_ratio 0.200000\n",
"total_chairs_skiable_ratio 0.040323\n",
"fastQuads_runs_ratio 0.000000\n",
"fastQuads_skiable_ratio 0.000000\n",
"dtype: float64"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# These are the values we'll use to fill in any missing values\n",
"X_defaults_median = X_train.median()\n",
"X_defaults_median"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.1.2 Apply the imputation to both train and test splits"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [],
"source": [
"X_tr = X_train.fillna(X_defaults_median)\n",
"X_te = X_test.fillna(X_defaults_median)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.1.3 Scale the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have features measured in many different units, with numbers that vary by orders of magnitude, start off by scaling them to put them all on a consistent scale. The [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) scales each feature to zero mean and unit variance."
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [],
"source": [
"#Call the StandardScaler`s fit method on `X_tr` to fit the scaler\n",
"#then use it's `transform()` method to apply the scaling to both the train and test split\n",
"#data (`X_tr` and `X_te`), naming the results `X_tr_scaled` and `X_te_scaled`, respectively\n",
"scaler = StandardScaler()\n",
"scaler.fit(X_tr)\n",
"X_tr_scaled = scaler.transform(X_tr)\n",
"X_te_scaled = scaler.transform(X_te)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.1.4 Train the model on the train split"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"lm = LinearRegression().fit(X_tr_scaled, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.1.5 Make predictions using the model on both train and test splits"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"#Call the `predict()` method of the model (`lm`) on both the (scaled) train and test data\n",
"#Assign the predictions to `y_tr_pred` and `y_te_pred`, respectively\n",
"y_tr_pred = lm.predict(X_tr_scaled)\n",
"y_te_pred = lm.predict(X_te_scaled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.1.6 Assess model performance"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.8237204449411376, 0.7251410286259974)"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# r^2 - train, test\n",
"median_r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)\n",
"median_r2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that we estimated ticket prices by simply using a known average. As expected, this produced an $R^2$ of zero for both the training and test set, because $R^2$ tells us how much of the variance we've explaining beyond that of using just the mean. Here, we see that our simple linear regression model explains over 80% of the variance on the train set and over 70% on the test set. Clearly, we are onto something, although the much lower value for the test set is indicative of overfitting. This isn't a surprise as we've made no effort to select a parsimonious set of features or deal with multicollinearity in our data."
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8.495768235382354, 9.696652536263656)"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now calculate the mean absolute error scores using `sklearn`'s `mean_absolute_error` function as we did above for R^2\n",
"# MAE - train, test\n",
"median_mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)\n",
"median_mae"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using this model, then, on average we'd expect to estimate a ticket price within \\\\$9 or so of the real price. This is much, much better than the \\\\$19 from just guessing using the average. There may be something to this machine learning lark after all!"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(108.75554891895914, 157.57287631288543)"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# And also do the same using `sklearn`'s `mean_squared_error`\n",
"# MSE - train, test\n",
"median_mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)\n",
"median_mse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.8.1.2 Impute missing values with the mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We chose to use the median for filling missing values because of the skew of many of our predictor feature distributions, let's try the mean."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.2.1 Learn the values to impute from the train set"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"summit_elev 4042.036269\n",
"vertical_drop 1057.264249\n",
"base_elev 2975.487047\n",
"trams 0.103627\n",
"fastSixes 0.093264\n",
"fastQuads 0.673575\n",
"quad 0.948187\n",
"triple 1.414508\n",
"double 1.746114\n",
"surface 2.476684\n",
"total_chairs 7.455959\n",
"Runs 41.387435\n",
"TerrainParks 2.447205\n",
"LongestRun_mi 1.301579\n",
"SkiableTerrain_ac 458.691099\n",
"Snow Making_ac 128.935294\n",
"daysOpenLastYear 109.761290\n",
"yearsOpen 56.895833\n",
"averageSnowfall 160.112903\n",
"projectedDaysOpen 114.900621\n",
"NightSkiing_ac 84.843478\n",
"resorts_per_state 16.523316\n",
"resorts_per_100kcapita 0.442984\n",
"resorts_per_100ksq_mile 42.862331\n",
"resort_skiable_area_ac_state_ratio 0.096680\n",
"resort_days_open_state_ratio 0.121639\n",
"resort_terrain_park_state_ratio 0.113116\n",
"resort_night_skiing_state_ratio 0.150272\n",
"total_chairs_runs_ratio 0.266321\n",
"total_chairs_skiable_ratio 0.070053\n",
"fastQuads_runs_ratio 0.010619\n",
"fastQuads_skiable_ratio 0.001700\n",
"dtype: float64"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# As we did for the median above, calculate mean values for imputing missing values\n",
"# These are the values we'll use to fill in any missing values\n",
"X_defaults_mean = X_train.mean()\n",
"X_defaults_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By eye, we can immediately tell that our replacement values are much higher than those from using the median."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.2.2 Apply the imputation to both train and test splits"
]
},
{
"cell_type": "code",
"execution_count": 194,
"metadata": {},
"outputs": [],
"source": [
"X_tr = X_train.fillna(X_defaults_mean)\n",
"X_te = X_test.fillna(X_defaults_mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.2.3 Scale the data"
]
},
{
"cell_type": "code",
"execution_count": 195,
"metadata": {},
"outputs": [],
"source": [
"scaler = StandardScaler()\n",
"scaler.fit(X_tr)\n",
"X_tr_scaled = scaler.transform(X_tr)\n",
"X_te_scaled = scaler.transform(X_te)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.2.4 Train the model on the train split"
]
},
{
"cell_type": "code",
"execution_count": 196,
"metadata": {},
"outputs": [],
"source": [
"lm = LinearRegression().fit(X_tr_scaled, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.2.5 Make predictions using the model on both train and test splits"
]
},
{
"cell_type": "code",
"execution_count": 197,
"metadata": {},
"outputs": [],
"source": [
"y_tr_pred = lm.predict(X_tr_scaled)\n",
"y_te_pred = lm.predict(X_te_scaled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 4.8.1.2.6 Assess model performance"
]
},
{
"cell_type": "code",
"execution_count": 198,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.8221207605475709, 0.7290195691422242)"
]
},
"execution_count": 198,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8.510780313012354, 9.565093916371973)"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(109.74247309324214, 155.34936226136)"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results don't seem very different to when the one's we used with the median for imputing missing values. Perhaps it doesn't make much difference here. Maybe our overtraining is worse than we thought. Maybe other feature transformations, such as taking the log, would help. We could try with just a subset of features rather than using all of them as inputs.\n",
"\n",
"To perform the median/mean comparison, we copied and pasted a lot of code just to change the function for imputing missing values. It would make more sense to write a function that performed the sequence of steps:\n",
"1. impute missing values\n",
"2. scale the features\n",
"3. train a model\n",
"4. calculate model performance\n",
"\n",
"These are common steps, and `sklearn` provides something much better than writing custom functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.8.2 Pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most important and useful components of `sklearn` is the [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). In place of Pandas's `fillna` DataFrame method, there is `sklearn`'s `SimpleImputer`. Remember the first linear model above performed the steps:\n",
"\n",
"1. replace missing values with the median for each feature\n",
"2. scale the data to zero mean and unit variance\n",
"3. train a linear regression model\n",
"\n",
"and all these steps were trained on the `train split` and then applied to the `test split` for assessment.\n",
"\n",
"The pipeline below defines exactly those same steps. Crucially, the resultant `Pipeline` object has a `fit()` method and a `predict()` method, just like the `LinearRegression()` object itself. Just as we might create a linear regression model and train it with `.fit()` and predict with `.predict()`, we can wrap the entire process of imputing and feature scaling and regression in a single object you can train with `.fit()` and predict with `.predict()`. And that's basically a pipeline: a model on steroids."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.8.2.1 Define the pipeline"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"pipe = make_pipeline(\n",
" SimpleImputer(strategy='median'), \n",
" StandardScaler(), \n",
" LinearRegression()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"sklearn.pipeline.Pipeline"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(pipe)"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(True, True)"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hasattr(pipe, 'fit'), hasattr(pipe, 'predict')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.8.2.2 Fit the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, a single call to the pipeline's `fit()` method combines the steps of learning the imputation (determining what values to use to fill the missing ones), the scaling (determining the mean to subtract and the variance to divide by), and then training the model. It does this all in the one call with the training data as arguments."
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n",
" ('standardscaler', StandardScaler()),\n",
" ('linearregression', LinearRegression())])"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Call the pipe's `fit()` method with `X_train` and `y_train` as arguments\n",
"pipe.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.8.2.3 Make predictions on the train and test sets"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [],
"source": [
"y_tr_pred = pipe.predict(X_train)\n",
"y_te_pred = pipe.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.8.2.4 Assess performance"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.8237204449411376, 0.7251410286259974)"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And compare with our earlier (non-pipeline) result:"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.8237204449411376, 0.7251410286259974)"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"median_r2"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8.495768235382354, 9.696652536263656)"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare with our earlier result:"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8.495768235382354, 9.696652536263656)"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"median_mae"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(108.75554891895914, 157.57287631288543)"
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare with our earlier result:"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(108.75554891895914, 157.57287631288543)"
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"median_mse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results confirm the pipeline is doing exactly what's expected, and results are identical to our earlier steps. This allows we to move faster but with confidence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.9 Refining The Linear Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We suspected the model was overfitting. This is no real surprise given the number of features we blindly used. It's likely a judicious subset of features would generalize better. `sklearn` has a number of feature selection functions available. The one we'll use here is `SelectKBest` which, as we might guess, selects the k best features. We can read about SelectKBest \n",
"[here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest). `f_regression` is just the [score function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression) We're using because we're performing regression. It's important to choose an appropriate one for our machine learning task."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.1 Define the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Redefine our pipeline to include this feature selection step:"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [],
"source": [
"pipe = make_pipeline(\n",
" SimpleImputer(strategy='median'), \n",
" StandardScaler(),\n",
" SelectKBest(score_func=f_regression),\n",
" LinearRegression()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.2 Fit the pipeline"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n",
" ('standardscaler', StandardScaler()),\n",
" ('selectkbest',\n",
" SelectKBest(score_func=)),\n",
" ('linearregression', LinearRegression())])"
]
},
"execution_count": 151,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.3 Assess performance on the train and test set"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [],
"source": [
"y_tr_pred = pipe.predict(X_train)\n",
"y_te_pred = pipe.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.760478965339582, 0.681569974499793)"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(9.757130441228263, 10.585905291034962)"
]
},
"execution_count": 154,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This has made things worse! Clearly selecting a subset of features has an impact on performance. `SelectKBest` defaults to k=10. Let's create a new pipeline with a different value of k:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.4 Define a new pipeline to select a different number of features"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [],
"source": [
"# Modify the `SelectKBest` step to use a value of 15 for k\n",
"pipe15 = make_pipeline(\n",
" SimpleImputer(strategy='median'), \n",
" StandardScaler(),\n",
" SelectKBest(score_func=f_regression, k=15),\n",
" LinearRegression()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.5 Fit the pipeline"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n",
" ('standardscaler', StandardScaler()),\n",
" ('selectkbest',\n",
" SelectKBest(k=15,\n",
" score_func=)),\n",
" ('linearregression', LinearRegression())])"
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe15.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.6 Assess performance on train and test data"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [],
"source": [
"y_tr_pred = pipe15.predict(X_train)\n",
"y_te_pred = pipe15.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.7922946911681397, 0.66079117939879)"
]
},
"execution_count": 158,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)"
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(9.214834764542976, 10.496823817105572)"
]
},
"execution_count": 159,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could keep going, trying different values of k, training a model, measuring performance on the test set, and then picking the model with the best test set performance. There's a fundamental problem with this approach: _we're tuning the model to the arbitrary test set_! If we continue this way we'll end up with a model works well on the particular quirks of our test set _but fails to generalize to new data_. The whole point of keeping a test set is for it to be a set of unseen data on which to test performance.\n",
"\n",
"The way around this is a technique called _cross-validation_. We partition the training set into k folds, train our model on k-1 of those folds, and calculate performance on the fold not used in training. This procedure then cycles through k times with a different fold held back each time. Thus we end up building k models on k sets of data with k estimates of how the model performs on unseen data but without having to touch the test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.7 Assessing performance using cross-validation"
]
},
{
"cell_type": "code",
"execution_count": 160,
"metadata": {},
"outputs": [],
"source": [
"# Run 5-Fold Cross validation\n",
"cv_results = cross_validate(pipe15, X_train, y_train, cv=5)"
]
},
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.60510478, 0.67731713, 0.75047442, 0.58935004, 0.50041885])"
]
},
"execution_count": 161,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get scores\n",
"cv_scores = cv_results['test_score']\n",
"cv_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Without using the same random state for initializing the CV folds, our actual numbers will be different."
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.6245330431201284, 0.08445948393083175)"
]
},
"execution_count": 162,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(cv_scores), np.std(cv_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results highlight that assessing model performance in inherently open to variability. We'll get different results depending on the quirks of which points are in which fold. An advantage of this is that you can also obtain an estimate of the variability, or uncertainty, in our performance estimate."
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.46, 0.79])"
]
},
"execution_count": 163,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.round((np.mean(cv_scores) - 2 * np.std(cv_scores), np.mean(cv_scores) + 2 * np.std(cv_scores)), 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.9.8 Hyperparameter search using GridSearchCV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pulling the above together, we have:\n",
"* a pipeline that\n",
" * imputes missing values\n",
" * scales the data\n",
" * selects the k best features\n",
" * trains a linear regression model\n",
"* a technique (cross-validation) for estimating model performance\n",
"\n",
"Now we will use cross-validation for multiple values of k, and then use cross-validation to pick the value of k that gives the best performance. `make_pipeline` automatically names each step in lowercase. Parameters of each step are then accessed by appending a double underscore followed by the parameter name. We know the name of the step will be 'selectkbest', and we know the parameter is 'k'."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also list the names of all the parameters in a pipeline as follows:"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize', 'linearregression__positive'])"
]
},
"execution_count": 164,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Call `pipe`'s `get_params()` method to get a dict of available parameters and print their names\n",
"# using dict's `keys()` method\n",
"pipe.get_params().keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above can be particularly useful as our pipelines becomes more complex (we can even nest pipelines within pipelines)."
]
},
{
"cell_type": "code",
"execution_count": 165,
"metadata": {},
"outputs": [],
"source": [
"k = [k+1 for k in range(len(X_train.columns))]\n",
"grid_params = {'selectkbest__k': k}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a range of `k` to investigate. Is 1 feature best? 2? 3? 4? All of them? We could write a for loop and iterate over each possible value, doing all the housekeeping ourselves to track the best value of k. But this is a common task, so there's a built in function in `sklearn`. This is [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).\n",
"\n",
"This takes the pipeline object, in fact it takes anything with a `.fit()` and `.predict()` method. In simple cases with no feature selection or imputation or feature scaling etc. we may see the classifier or regressor object itself directly passed into `GridSearchCV`. The other key input is the set of parameters and values to search over. Optional parameters include the cross-validation strategy and number of CPUs to use."
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {},
"outputs": [],
"source": [
"lr_grid_cv = GridSearchCV(pipe, param_grid=grid_params, cv=5, n_jobs=-1)"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=5,\n",
" estimator=Pipeline(steps=[('simpleimputer',\n",
" SimpleImputer(strategy='median')),\n",
" ('standardscaler', StandardScaler()),\n",
" ('selectkbest',\n",
" SelectKBest(score_func=)),\n",
" ('linearregression',\n",
" LinearRegression())]),\n",
" n_jobs=-1,\n",
" param_grid={'selectkbest__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,\n",
" 12, 13, 14, 15, 16, 17, 18, 19, 20,\n",
" 21, 22, 23, 24, 25, 26, 27, 28, 29,\n",
" 30, ...]})"
]
},
"execution_count": 167,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lr_grid_cv.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {},
"outputs": [],
"source": [
"score_mean = lr_grid_cv.cv_results_['mean_test_score']\n",
"score_std = lr_grid_cv.cv_results_['std_test_score']\n",
"cv_k = [k for k in lr_grid_cv.cv_results_['param_selectkbest__k']]"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'selectkbest__k': 6}"
]
},
"execution_count": 169,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Print the `best_params_` attribute of `lr_grid_cv`\n",
"lr_grid_cv.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"