{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 Pre-Processing and Training Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.1 Contents\n", "* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)\n", " * [4.1 Contents](#4.1_Contents)\n", " * [4.2 Introduction](#4.2_Introduction)\n", " * [4.3 Imports](#4.3_Imports)\n", " * [4.4 Load Data](#4.4_Load_Data)\n", " * [4.5 Extract Big Mountain Data](#4.5_Extract_Big_Mountain_Data)\n", " * [4.6 Train/Test Split](#4.6_Train/Test_Split)\n", " * [4.7 Initial Not-Even-A-Model](#4.7_Initial_Not-Even-A-Model)\n", " * [4.7.1 Metrics](#4.7.1_Metrics)\n", " * [4.7.1.1 R-squared, or coefficient of determination](#4.7.1.1_R-squared,_or_coefficient_of_determination)\n", " * [4.7.1.2 Mean Absolute Error](#4.7.1.2_Mean_Absolute_Error)\n", " * [4.7.1.3 Mean Squared Error](#4.7.1.3_Mean_Squared_Error)\n", " * [4.7.2 sklearn metrics](#4.7.2_sklearn_metrics)\n", " * [4.7.2.0.1 R-squared](#4.7.2.0.1_R-squared)\n", " * [4.7.2.0.2 Mean absolute error](#4.7.2.0.2_Mean_absolute_error)\n", " * [4.7.2.0.3 Mean squared error](#4.7.2.0.3_Mean_squared_error)\n", " * [4.7.3 Note On Calculating Metrics](#4.7.3_Note_On_Calculating_Metrics)\n", " * [4.8 Initial Models](#4.8_Initial_Models)\n", " * [4.8.1 Imputing missing feature (predictor) values](#4.8.1_Imputing_missing_feature_(predictor)_values)\n", " * [4.8.1.1 Impute missing values with median](#4.8.1.1_Impute_missing_values_with_median)\n", " * [4.8.1.1.1 Learn the values to impute from the train set](#4.8.1.1.1_Learn_the_values_to_impute_from_the_train_set)\n", " * [4.8.1.1.2 Apply the imputation to both train and test splits](#4.8.1.1.2_Apply_the_imputation_to_both_train_and_test_splits)\n", " * [4.8.1.1.3 Scale the data](#4.8.1.1.3_Scale_the_data)\n", " * [4.8.1.1.4 Train the model on the train split](#4.8.1.1.4_Train_the_model_on_the_train_split)\n", " * [4.8.1.1.5 Make predictions using the model on both train and test splits](#4.8.1.1.5_Make_predictions_using_the_model_on_both_train_and_test_splits)\n", " * [4.8.1.1.6 Assess model performance](#4.8.1.1.6_Assess_model_performance)\n", " * [4.8.1.2 Impute missing values with the mean](#4.8.1.2_Impute_missing_values_with_the_mean)\n", " * [4.8.1.2.1 Learn the values to impute from the train set](#4.8.1.2.1_Learn_the_values_to_impute_from_the_train_set)\n", " * [4.8.1.2.2 Apply the imputation to both train and test splits](#4.8.1.2.2_Apply_the_imputation_to_both_train_and_test_splits)\n", " * [4.8.1.2.3 Scale the data](#4.8.1.2.3_Scale_the_data)\n", " * [4.8.1.2.4 Train the model on the train split](#4.8.1.2.4_Train_the_model_on_the_train_split)\n", " * [4.8.1.2.5 Make predictions using the model on both train and test splits](#4.8.1.2.5_Make_predictions_using_the_model_on_both_train_and_test_splits)\n", " * [4.8.1.2.6 Assess model performance](#4.8.1.2.6_Assess_model_performance)\n", " * [4.8.2 Pipelines](#4.8.2_Pipelines)\n", " * [4.8.2.1 Define the pipeline](#4.8.2.1_Define_the_pipeline)\n", " * [4.8.2.2 Fit the pipeline](#4.8.2.2_Fit_the_pipeline)\n", " * [4.8.2.3 Make predictions on the train and test sets](#4.8.2.3_Make_predictions_on_the_train_and_test_sets)\n", " * [4.8.2.4 Assess performance](#4.8.2.4_Assess_performance)\n", " * [4.9 Refining The Linear Model](#4.9_Refining_The_Linear_Model)\n", " * [4.9.1 Define the pipeline](#4.9.1_Define_the_pipeline)\n", " * [4.9.2 Fit the pipeline](#4.9.2_Fit_the_pipeline)\n", " * [4.9.3 Assess performance on the train and test set](#4.9.3_Assess_performance_on_the_train_and_test_set)\n", " * [4.9.4 Define a new pipeline to select a different number of features](#4.9.4_Define_a_new_pipeline_to_select_a_different_number_of_features)\n", " * [4.9.5 Fit the pipeline](#4.9.5_Fit_the_pipeline)\n", " * [4.9.6 Assess performance on train and test data](#4.9.6_Assess_performance_on_train_and_test_data)\n", " * [4.9.7 Assessing performance using cross-validation](#4.9.7_Assessing_performance_using_cross-validation)\n", " * [4.9.8 Hyperparameter search using GridSearchCV](#4.9.8_Hyperparameter_search_using_GridSearchCV)\n", " * [4.10 Random Forest Model](#4.10_Random_Forest_Model)\n", " * [4.10.1 Define the pipeline](#4.10.1_Define_the_pipeline)\n", " * [4.10.2 Fit and assess performance using cross-validation](#4.10.2_Fit_and_assess_performance_using_cross-validation)\n", " * [4.10.3 Hyperparameter search using GridSearchCV](#4.10.3_Hyperparameter_search_using_GridSearchCV)\n", " * [4.11 Final Model Selection](#4.11_Final_Model_Selection)\n", " * [4.11.1 Linear regression model performance](#4.11.1_Linear_regression_model_performance)\n", " * [4.11.2 Random forest regression model performance](#4.11.2_Random_forest_regression_model_performance)\n", " * [4.11.3 Conclusion](#4.11.3_Conclusion)\n", " * [4.12 Data quantity assessment](#4.12_Data_quantity_assessment)\n", " * [4.13 Save best model object from pipeline](#4.13_Save_best_model_object_from_pipeline)\n", " * [4.14 Summary](#4.14_Summary)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In preceding notebooks, we performed preliminary assessments of data quality and refined the question to be answered. We found a small number of observations that clearly indicated whether to replace values or drop a whole row. We determined that predicting the adult weekend ticket price was our primary aim. We threw away records with missing price data, but not before making the most of the other available data to look for any patterns among the states. We didn't see any and decided to treat all states equally; the state label didn't seem to be particularly useful.\n", "\n", "In this notebook, we'll start to build machine learning models. Before diving into a machine learning model, however, we'll start by considering how useful the mean value is as a predictor. We never want to go to stakeholders with a machine learning model only to have the CEO point out that it performs worse than just guessing the average! Our first model is always a baseline performance comparitor for any subsequent model. Next, we'll build up the process of efficiently creating robust models to compare to our baseline forecast. We can validate steps with our own functions for checking expected equivalences between, say, pandas and sklearn implementations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.3 Imports" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import os\n", "import pickle\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn import __version__ as sklearn_version\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import scale\n", "from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "from sklearn.dummy import DummyRegressor\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.feature_selection import SelectKBest, f_regression\n", "import datetime\n", "\n", "from library.sb_utils import save_file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.4 Load Data" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
NameAlyeska ResortEaglecrest Ski AreaHilltop Ski AreaArizona SnowbowlSunrise Park Resort
RegionAlaskaAlaskaAlaskaArizonaArizona
stateAlaskaAlaskaAlaskaArizonaArizona
summit_elev3939260020901150011100
vertical_drop2500154029423001800
base_elev2501200179692009200
trams10000
fastSixes00010
fastQuads20001
quad20022
triple00123
double04011
surface20220
total_chairs74387
Runs76.036.013.055.065.0
TerrainParks2.01.01.04.02.0
LongestRun_mi1.02.01.02.01.2
SkiableTerrain_ac1610.0640.030.0777.0800.0
Snow Making_ac113.060.030.0104.080.0
daysOpenLastYear150.045.0150.0122.0115.0
yearsOpen60.044.036.081.049.0
averageSnowfall669.0350.069.0260.0250.0
AdultWeekend85.053.034.089.078.0
projectedDaysOpen150.090.0152.0122.0104.0
NightSkiing_ac550.0NaN30.0NaN80.0
resorts_per_state33322
resorts_per_100kcapita0.4100910.4100910.4100910.0274770.027477
resorts_per_100ksq_mile0.4508670.4508670.4508671.754541.75454
resort_skiable_area_ac_state_ratio0.706140.2807020.0131580.4927080.507292
resort_days_open_state_ratio0.4347830.1304350.4347830.5147680.485232
resort_terrain_park_state_ratio0.50.250.250.6666670.333333
resort_night_skiing_state_ratio0.948276NaN0.051724NaN1.0
total_chairs_runs_ratio0.0921050.1111110.2307690.1454550.107692
total_chairs_skiable_ratio0.0043480.006250.10.0102960.00875
fastQuads_runs_ratio0.0263160.00.00.00.015385
fastQuads_skiable_ratio0.0012420.00.00.00.00125
\n", "
" ], "text/plain": [ " 0 1 \\\n", "Name Alyeska Resort Eaglecrest Ski Area \n", "Region Alaska Alaska \n", "state Alaska Alaska \n", "summit_elev 3939 2600 \n", "vertical_drop 2500 1540 \n", "base_elev 250 1200 \n", "trams 1 0 \n", "fastSixes 0 0 \n", "fastQuads 2 0 \n", "quad 2 0 \n", "triple 0 0 \n", "double 0 4 \n", "surface 2 0 \n", "total_chairs 7 4 \n", "Runs 76.0 36.0 \n", "TerrainParks 2.0 1.0 \n", "LongestRun_mi 1.0 2.0 \n", "SkiableTerrain_ac 1610.0 640.0 \n", "Snow Making_ac 113.0 60.0 \n", "daysOpenLastYear 150.0 45.0 \n", "yearsOpen 60.0 44.0 \n", "averageSnowfall 669.0 350.0 \n", "AdultWeekend 85.0 53.0 \n", "projectedDaysOpen 150.0 90.0 \n", "NightSkiing_ac 550.0 NaN \n", "resorts_per_state 3 3 \n", "resorts_per_100kcapita 0.410091 0.410091 \n", "resorts_per_100ksq_mile 0.450867 0.450867 \n", "resort_skiable_area_ac_state_ratio 0.70614 0.280702 \n", "resort_days_open_state_ratio 0.434783 0.130435 \n", "resort_terrain_park_state_ratio 0.5 0.25 \n", "resort_night_skiing_state_ratio 0.948276 NaN \n", "total_chairs_runs_ratio 0.092105 0.111111 \n", "total_chairs_skiable_ratio 0.004348 0.00625 \n", "fastQuads_runs_ratio 0.026316 0.0 \n", "fastQuads_skiable_ratio 0.001242 0.0 \n", "\n", " 2 3 \\\n", "Name Hilltop Ski Area Arizona Snowbowl \n", "Region Alaska Arizona \n", "state Alaska Arizona \n", "summit_elev 2090 11500 \n", "vertical_drop 294 2300 \n", "base_elev 1796 9200 \n", "trams 0 0 \n", "fastSixes 0 1 \n", "fastQuads 0 0 \n", "quad 0 2 \n", "triple 1 2 \n", "double 0 1 \n", "surface 2 2 \n", "total_chairs 3 8 \n", "Runs 13.0 55.0 \n", "TerrainParks 1.0 4.0 \n", "LongestRun_mi 1.0 2.0 \n", "SkiableTerrain_ac 30.0 777.0 \n", "Snow Making_ac 30.0 104.0 \n", "daysOpenLastYear 150.0 122.0 \n", "yearsOpen 36.0 81.0 \n", "averageSnowfall 69.0 260.0 \n", "AdultWeekend 34.0 89.0 \n", "projectedDaysOpen 152.0 122.0 \n", "NightSkiing_ac 30.0 NaN \n", "resorts_per_state 3 2 \n", "resorts_per_100kcapita 0.410091 0.027477 \n", "resorts_per_100ksq_mile 0.450867 1.75454 \n", "resort_skiable_area_ac_state_ratio 0.013158 0.492708 \n", "resort_days_open_state_ratio 0.434783 0.514768 \n", "resort_terrain_park_state_ratio 0.25 0.666667 \n", "resort_night_skiing_state_ratio 0.051724 NaN \n", "total_chairs_runs_ratio 0.230769 0.145455 \n", "total_chairs_skiable_ratio 0.1 0.010296 \n", "fastQuads_runs_ratio 0.0 0.0 \n", "fastQuads_skiable_ratio 0.0 0.0 \n", "\n", " 4 \n", "Name Sunrise Park Resort \n", "Region Arizona \n", "state Arizona \n", "summit_elev 11100 \n", "vertical_drop 1800 \n", "base_elev 9200 \n", "trams 0 \n", "fastSixes 0 \n", "fastQuads 1 \n", "quad 2 \n", "triple 3 \n", "double 1 \n", "surface 0 \n", "total_chairs 7 \n", "Runs 65.0 \n", "TerrainParks 2.0 \n", "LongestRun_mi 1.2 \n", "SkiableTerrain_ac 800.0 \n", "Snow Making_ac 80.0 \n", "daysOpenLastYear 115.0 \n", "yearsOpen 49.0 \n", "averageSnowfall 250.0 \n", "AdultWeekend 78.0 \n", "projectedDaysOpen 104.0 \n", "NightSkiing_ac 80.0 \n", "resorts_per_state 2 \n", "resorts_per_100kcapita 0.027477 \n", "resorts_per_100ksq_mile 1.75454 \n", "resort_skiable_area_ac_state_ratio 0.507292 \n", "resort_days_open_state_ratio 0.485232 \n", "resort_terrain_park_state_ratio 0.333333 \n", "resort_night_skiing_state_ratio 1.0 \n", "total_chairs_runs_ratio 0.107692 \n", "total_chairs_skiable_ratio 0.00875 \n", "fastQuads_runs_ratio 0.015385 \n", "fastQuads_skiable_ratio 0.00125 " ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ski_data = pd.read_csv('../data/ski_data_step3_features.csv')\n", "ski_data.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.5 Extract Big Mountain Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Big Mountain is our resort. Separate it from the rest of the data to use later." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "big_mountain = ski_data[ski_data.Name == 'Big Mountain Resort']" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
124
NameBig Mountain Resort
RegionMontana
stateMontana
summit_elev6817
vertical_drop2353
base_elev4464
trams0
fastSixes0
fastQuads3
quad2
triple6
double0
surface3
total_chairs14
Runs105.0
TerrainParks4.0
LongestRun_mi3.3
SkiableTerrain_ac3000.0
Snow Making_ac600.0
daysOpenLastYear123.0
yearsOpen72.0
averageSnowfall333.0
AdultWeekend81.0
projectedDaysOpen123.0
NightSkiing_ac600.0
resorts_per_state12
resorts_per_100kcapita1.122778
resorts_per_100ksq_mile8.161045
resort_skiable_area_ac_state_ratio0.140121
resort_days_open_state_ratio0.129338
resort_terrain_park_state_ratio0.148148
resort_night_skiing_state_ratio0.84507
total_chairs_runs_ratio0.133333
total_chairs_skiable_ratio0.004667
fastQuads_runs_ratio0.028571
fastQuads_skiable_ratio0.001
\n", "
" ], "text/plain": [ " 124\n", "Name Big Mountain Resort\n", "Region Montana\n", "state Montana\n", "summit_elev 6817\n", "vertical_drop 2353\n", "base_elev 4464\n", "trams 0\n", "fastSixes 0\n", "fastQuads 3\n", "quad 2\n", "triple 6\n", "double 0\n", "surface 3\n", "total_chairs 14\n", "Runs 105.0\n", "TerrainParks 4.0\n", "LongestRun_mi 3.3\n", "SkiableTerrain_ac 3000.0\n", "Snow Making_ac 600.0\n", "daysOpenLastYear 123.0\n", "yearsOpen 72.0\n", "averageSnowfall 333.0\n", "AdultWeekend 81.0\n", "projectedDaysOpen 123.0\n", "NightSkiing_ac 600.0\n", "resorts_per_state 12\n", "resorts_per_100kcapita 1.122778\n", "resorts_per_100ksq_mile 8.161045\n", "resort_skiable_area_ac_state_ratio 0.140121\n", "resort_days_open_state_ratio 0.129338\n", "resort_terrain_park_state_ratio 0.148148\n", "resort_night_skiing_state_ratio 0.84507\n", "total_chairs_runs_ratio 0.133333\n", "total_chairs_skiable_ratio 0.004667\n", "fastQuads_runs_ratio 0.028571\n", "fastQuads_skiable_ratio 0.001" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "big_mountain.T" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(278, 36)" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ski_data.shape" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "ski_data = ski_data[ski_data.Name != 'Big Mountain Resort']" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(277, 36)" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ski_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.6 Train/Test Split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we've treated ski resort data as a single entity. In machine learning, when we train our model on all of our data, we end up with no data set aside to evaluate model performance. We could keep making more and more complex models that fit the data better and better and not realise we are overfitting the model. By partitioning the data into training and testing splits, without letting a model (or missing-value imputation) learn anything about the test split, we have a somewhat independent assessment of how our model might perform in the future. An often overlooked subtlety here is that people all too frequently use the test set to assess model performance _and then compare multiple models to pick the best_. This means their overall model selection process is flawed: The engineer picks the model sans help from the test set. Instead we use held-out data and/or k-fold cross-validation to simulate additional test sets and assess model performance. The formal test set is very useful as a final check on expected future performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What partition sizes would we have with a 70/30 train/test split?" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(193.89999999999998, 83.1)" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ski_data) * .7, len(ski_data) * .3" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "# Generate test and train sets for X and Y variables; set random state to get reproducable results\n", "X_train, X_test, y_train, y_test = train_test_split(ski_data.drop(columns='AdultWeekend'), \n", " ski_data.AdultWeekend, test_size=0.3, \n", " random_state=47)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((193, 35), (84, 35))" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the shapes of X train and test sets\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((193,), (84,))" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the shapes of train Y train and test sets\n", "y_train.shape, y_test.shape" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((193, 32), (84, 32))" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names_list = ['Name', 'state', 'Region']\n", "names_train = X_train[names_list]\n", "names_test = X_test[names_list]\n", "X_train.drop(columns=names_list, inplace=True)\n", "X_test.drop(columns=names_list, inplace=True)\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "summit_elev int64\n", "vertical_drop int64\n", "base_elev int64\n", "trams int64\n", "fastSixes int64\n", "fastQuads int64\n", "quad int64\n", "triple int64\n", "double int64\n", "surface int64\n", "total_chairs int64\n", "Runs float64\n", "TerrainParks float64\n", "LongestRun_mi float64\n", "SkiableTerrain_ac float64\n", "Snow Making_ac float64\n", "daysOpenLastYear float64\n", "yearsOpen float64\n", "averageSnowfall float64\n", "projectedDaysOpen float64\n", "NightSkiing_ac float64\n", "resorts_per_state int64\n", "resorts_per_100kcapita float64\n", "resorts_per_100ksq_mile float64\n", "resort_skiable_area_ac_state_ratio float64\n", "resort_days_open_state_ratio float64\n", "resort_terrain_park_state_ratio float64\n", "resort_night_skiing_state_ratio float64\n", "total_chairs_runs_ratio float64\n", "total_chairs_skiable_ratio float64\n", "fastQuads_runs_ratio float64\n", "fastQuads_skiable_ratio float64\n", "dtype: object" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the `dtypes` attribute of `X_train` to verify all features are numeric\n", "X_train.dtypes" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "summit_elev int64\n", "vertical_drop int64\n", "base_elev int64\n", "trams int64\n", "fastSixes int64\n", "fastQuads int64\n", "quad int64\n", "triple int64\n", "double int64\n", "surface int64\n", "total_chairs int64\n", "Runs float64\n", "TerrainParks float64\n", "LongestRun_mi float64\n", "SkiableTerrain_ac float64\n", "Snow Making_ac float64\n", "daysOpenLastYear float64\n", "yearsOpen float64\n", "averageSnowfall float64\n", "projectedDaysOpen float64\n", "NightSkiing_ac float64\n", "resorts_per_state int64\n", "resorts_per_100kcapita float64\n", "resorts_per_100ksq_mile float64\n", "resort_skiable_area_ac_state_ratio float64\n", "resort_days_open_state_ratio float64\n", "resort_terrain_park_state_ratio float64\n", "resort_night_skiing_state_ratio float64\n", "total_chairs_runs_ratio float64\n", "total_chairs_skiable_ratio float64\n", "fastQuads_runs_ratio float64\n", "fastQuads_skiable_ratio float64\n", "dtype: object" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Repeat this check for the test split in `X_test`\n", "X_test.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have only numeric features in X now!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.7 Initial Not-Even-A-Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll begin by determining how good the mean is as a predictor. In other words, what if we simply say our best guess is the average price?" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "63.84730569948186" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the mean of `y_train`\n", "train_mean = y_train.mean()\n", "train_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sklearn`'s `DummyRegressor` easily does this:" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[63.8473057]])" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sanity Check\n", "dumb_reg = DummyRegressor(strategy='mean')\n", "dumb_reg.fit(X_train, y_train)\n", "dumb_reg.constant_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having established the grand mean, we need to determine how closely it matches, or explains, the actual values. There are many ways of assessing how good one set of values agrees with another, which brings us to the subject of metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.7.1 Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.7.1.1 R-squared, or coefficient of determination" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One measure is $R^2$, the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination). This is a measure of the proportion of variance in the dependent variable (our ticket price) that is predicted by our \"model\". The linked Wikipedia articles gives a nice explanation of how negative values can arise. This is frequently a cause of confusion for newcomers who, reasonably, ask how can a squared value be negative?\n", "\n", "Recall the mean can be denoted by $\\bar{y}$, where\n", "\n", "$$\\bar{y} = \\frac{1}{n}\\sum_{i=1}^ny_i$$\n", "\n", "and where $y_i$ are the individual values of the dependent variable.\n", "\n", "The total sum of squares (error), can be expressed as\n", "\n", "$$SS_{tot} = \\sum_i(y_i-\\bar{y})^2$$\n", "\n", "The above formula should be familiar as it's simply the variance without the denominator to scale (divide) by the sample size.\n", "\n", "The residual sum of squares is similarly defined to be\n", "\n", "$$SS_{res} = \\sum_i(y_i-\\hat{y})^2$$\n", "\n", "where $\\hat{y}$ are our predicted values for the depended variable.\n", "\n", "The coefficient of determination, $R^2$, here is given by\n", "\n", "$$R^2 = 1 - \\frac{SS_{res}}{SS_{tot}}$$\n", "\n", "Putting it into words, it's one minus the ratio of the residual variance to the original variance. Thus, the baseline model here, which always predicts $\\bar{y}$, should give $R^2=0$. A model that perfectly predicts the observed values would have no residual error and so give $R^2=1$. Models that do worse than predicting the mean will have increased the sum of squares of residuals and so produce a negative $R^2$." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "#Calculate the R^2 as defined above\n", "def r_squared(y, ypred):\n", " \"\"\"R-squared score.\n", " \n", " Calculate the R-squared, or coefficient of determination, of the input.\n", " \n", " Arguments:\n", " y -- the observed values\n", " ypred -- the predicted values\n", " \"\"\"\n", " ybar = np.mean(y)\n", " sum_sq_tot = np.sum((y - ybar)**2)\n", " sum_sq_res = np.sum((y - ypred)**2)\n", " R2 = 1.0 - sum_sq_res / sum_sq_tot\n", " return R2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We make our predictions by creating an array of length the size of the training set with the single value of the mean." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_tr_pred_ = train_mean * np.ones(len(y_train))\n", "y_tr_pred_[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember the `sklearn` dummy regressor? " ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_tr_pred = dumb_reg.predict(X_train)\n", "y_tr_pred[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that `DummyRegressor` produces exactly the same results and saves us from having to broadcast the mean (or whichever other statistic we used - check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html) to see what's available) to an array of the appropriate length. It also gives us an object with `fit()` and `predict()` methods as well, so we can use them as conveniently as any other `sklearn` estimator." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_squared(y_train, y_tr_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exactly as expected, if we use the average value as our prediction, we get an $R^2$ of zero _on our training set_. What if we use this \"model\" to predict unseen values from the test set? Remember, of course, that our \"model\" is trained on the training set; we still use the training set mean as our prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make predictions by creating an array of length the size of the test set with the single value of the (training) mean." ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.0015364646867073173" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_te_pred = train_mean * np.ones(len(y_test))\n", "r_squared(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally, you can expect performance on a test set to be slightly worse than on the training set. As you are getting an $R^2$ of zero on the training set, there's nowhere to go but negative!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$R^2$ is a common metric, and interpretable in terms of the amount of variance explained, it's less appealing if we want an idea of how \"close\" our predictions are to the true values. Metrics that summarise the difference between predicted and actual values are _mean absolute error_ and _mean squared error_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.7.1.2 Mean Absolute Error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is very simply the average of the absolute errors:\n", "\n", "$$MAE = \\frac{1}{n}\\sum_i^n|y_i - \\hat{y}|$$" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "def mae(y, ypred):\n", " \"\"\"Mean absolute error.\n", " \n", " Calculate the mean absolute error of the arguments\n", "\n", " Arguments:\n", " y -- the observed values\n", " ypred -- the predicted values\n", " \"\"\"\n", " abs_error = np.abs(y - ypred)\n", " mae = np.mean(abs_error)\n", " return mae" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "18.149503610835193" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mae(y_train, y_tr_pred)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "18.672179249938317" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mae(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mean absolute error is arguably the most intuitive of all the metrics, this essentially tells you that, on average, you might expect to be off by around \\\\$19 if you guessed ticket price based on an average of known values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.7.1.3 Mean Squared Error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common metric (and an important one internally for optimizing machine learning models) is the mean squared error. This is simply the average of the square of the errors:\n", "\n", "$$MSE = \\frac{1}{n}\\sum_i^n(y_i - \\hat{y})^2$$" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Calculate the MSE as defined above\n", "def mse(y, ypred):\n", " \"\"\"Mean square error.\n", " \n", " Calculate the mean square error of the arguments\n", "\n", " Arguments:\n", " y -- the observed values\n", " ypred -- the predicted values\n", " \"\"\"\n", " sq_error = (y - ypred)**2\n", " mse = np.mean(sq_error)\n", " return mse" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "616.9493046578431" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mse(y_train, y_tr_pred)" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "574.1671108060107" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mse(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So here, we get a slightly better MSE on the test set than we did on the train set. And what does a squared error mean anyway? To convert this back to our measurement space, we often take the square root, to form the _root mean square error_ thus:" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([24.83846422, 23.96178438])" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sqrt([mse(y_train, y_tr_pred), mse(y_test, y_te_pred)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.7.2 sklearn metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions are good, but you don't want to have to define functions every time we want to assess performance. `sklearn.metrics` provides many commonly used metrics, included the ones above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.7.2.0.1 R-squared" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.0, -0.0015364646867073173)" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.7.2.0.2 Mean absolute error" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(18.149503610835193, 18.672179249938317)" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.7.2.0.3 Mean squared error" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(616.9493046578431, 574.1671108060107)" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.7.3 Note On Calculating Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a Jupyter code cell, running `r2_score?` will bring up the docstring for the function, and `r2_score??` will bring up the actual code of the function! Here we try it and compare the source for `sklearn`'s function with ours." ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.0, -3.054984985780873e+30)" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# train set - sklearn\n", "# correct order, incorrect order\n", "r2_score(y_train, y_tr_pred), r2_score(y_tr_pred, y_train)" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-0.0015364646867073173, -2.8431378228302645e+30)" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test set - sklearn\n", "# correct order, incorrect order\n", "r2_score(y_test, y_te_pred), r2_score(y_te_pred, y_test)" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.0, -3.054984985780873e+30)" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# train set - using our homebrew function\n", "# correct order, incorrect order\n", "r_squared(y_train, y_tr_pred), r_squared(y_tr_pred, y_train)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-0.0015364646867073173, -2.8431378228302645e+30)" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test set - using our homebrew function\n", "# correct order, incorrect order\n", "r_squared(y_test, y_te_pred), r_squared(y_te_pred, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get very different results swapping the argument order. It's worth highlighting this because data scientists do this too much in the real world! Frequently the argument order doesn't matter, but it will bite when we do it with a function that does care. It's sloppy, bad practice and if we don't make a habit of putting arguments in the right order, we stand to forget!\n", "\n", "Remember:\n", "* argument order matters,\n", "* check function syntax with `func?` in a code cell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.8 Initial Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.8.1 Imputing missing feature (predictor) values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall when performing EDA, we imputed (filled in) some missing values in Pandas. We can impute missing values using scikit-learn, but we will prioritize imputation from a train split and apply that to the test split to then assess how well our imputation worked." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.8.1.1 Impute missing values with median" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have missing values. Recall from our data exploration that many distributions were skewed. Our first thought might be to impute missing values using the median." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.1.1 Learn the values to impute from the train set" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "summit_elev 2175.000000\n", "vertical_drop 750.000000\n", "base_elev 1280.000000\n", "trams 0.000000\n", "fastSixes 0.000000\n", "fastQuads 0.000000\n", "quad 0.000000\n", "triple 1.000000\n", "double 1.000000\n", "surface 2.000000\n", "total_chairs 6.000000\n", "Runs 29.000000\n", "TerrainParks 2.000000\n", "LongestRun_mi 1.000000\n", "SkiableTerrain_ac 170.000000\n", "Snow Making_ac 96.500000\n", "daysOpenLastYear 107.000000\n", "yearsOpen 57.000000\n", "averageSnowfall 120.000000\n", "projectedDaysOpen 112.000000\n", "NightSkiing_ac 70.000000\n", "resorts_per_state 15.000000\n", "resorts_per_100kcapita 0.248243\n", "resorts_per_100ksq_mile 24.428973\n", "resort_skiable_area_ac_state_ratio 0.050000\n", "resort_days_open_state_ratio 0.070595\n", "resort_terrain_park_state_ratio 0.069444\n", "resort_night_skiing_state_ratio 0.066804\n", "total_chairs_runs_ratio 0.200000\n", "total_chairs_skiable_ratio 0.040323\n", "fastQuads_runs_ratio 0.000000\n", "fastQuads_skiable_ratio 0.000000\n", "dtype: float64" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# These are the values we'll use to fill in any missing values\n", "X_defaults_median = X_train.median()\n", "X_defaults_median" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.1.2 Apply the imputation to both train and test splits" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "X_tr = X_train.fillna(X_defaults_median)\n", "X_te = X_test.fillna(X_defaults_median)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.1.3 Scale the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have features measured in many different units, with numbers that vary by orders of magnitude, start off by scaling them to put them all on a consistent scale. The [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) scales each feature to zero mean and unit variance." ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "#Call the StandardScaler`s fit method on `X_tr` to fit the scaler\n", "#then use it's `transform()` method to apply the scaling to both the train and test split\n", "#data (`X_tr` and `X_te`), naming the results `X_tr_scaled` and `X_te_scaled`, respectively\n", "scaler = StandardScaler()\n", "scaler.fit(X_tr)\n", "X_tr_scaled = scaler.transform(X_tr)\n", "X_te_scaled = scaler.transform(X_te)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.1.4 Train the model on the train split" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "lm = LinearRegression().fit(X_tr_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.1.5 Make predictions using the model on both train and test splits" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "#Call the `predict()` method of the model (`lm`) on both the (scaled) train and test data\n", "#Assign the predictions to `y_tr_pred` and `y_te_pred`, respectively\n", "y_tr_pred = lm.predict(X_tr_scaled)\n", "y_te_pred = lm.predict(X_te_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.1.6 Assess model performance" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.8237204449411376, 0.7251410286259974)" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# r^2 - train, test\n", "median_r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)\n", "median_r2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that we estimated ticket prices by simply using a known average. As expected, this produced an $R^2$ of zero for both the training and test set, because $R^2$ tells us how much of the variance we've explaining beyond that of using just the mean. Here, we see that our simple linear regression model explains over 80% of the variance on the train set and over 70% on the test set. Clearly, we are onto something, although the much lower value for the test set is indicative of overfitting. This isn't a surprise as we've made no effort to select a parsimonious set of features or deal with multicollinearity in our data." ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8.495768235382354, 9.696652536263656)" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now calculate the mean absolute error scores using `sklearn`'s `mean_absolute_error` function as we did above for R^2\n", "# MAE - train, test\n", "median_mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)\n", "median_mae" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this model, then, on average we'd expect to estimate a ticket price within \\\\$9 or so of the real price. This is much, much better than the \\\\$19 from just guessing using the average. There may be something to this machine learning lark after all!" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108.75554891895914, 157.57287631288543)" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# And also do the same using `sklearn`'s `mean_squared_error`\n", "# MSE - train, test\n", "median_mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)\n", "median_mse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.8.1.2 Impute missing values with the mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We chose to use the median for filling missing values because of the skew of many of our predictor feature distributions, let's try the mean." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.2.1 Learn the values to impute from the train set" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "summit_elev 4042.036269\n", "vertical_drop 1057.264249\n", "base_elev 2975.487047\n", "trams 0.103627\n", "fastSixes 0.093264\n", "fastQuads 0.673575\n", "quad 0.948187\n", "triple 1.414508\n", "double 1.746114\n", "surface 2.476684\n", "total_chairs 7.455959\n", "Runs 41.387435\n", "TerrainParks 2.447205\n", "LongestRun_mi 1.301579\n", "SkiableTerrain_ac 458.691099\n", "Snow Making_ac 128.935294\n", "daysOpenLastYear 109.761290\n", "yearsOpen 56.895833\n", "averageSnowfall 160.112903\n", "projectedDaysOpen 114.900621\n", "NightSkiing_ac 84.843478\n", "resorts_per_state 16.523316\n", "resorts_per_100kcapita 0.442984\n", "resorts_per_100ksq_mile 42.862331\n", "resort_skiable_area_ac_state_ratio 0.096680\n", "resort_days_open_state_ratio 0.121639\n", "resort_terrain_park_state_ratio 0.113116\n", "resort_night_skiing_state_ratio 0.150272\n", "total_chairs_runs_ratio 0.266321\n", "total_chairs_skiable_ratio 0.070053\n", "fastQuads_runs_ratio 0.010619\n", "fastQuads_skiable_ratio 0.001700\n", "dtype: float64" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As we did for the median above, calculate mean values for imputing missing values\n", "# These are the values we'll use to fill in any missing values\n", "X_defaults_mean = X_train.mean()\n", "X_defaults_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By eye, we can immediately tell that our replacement values are much higher than those from using the median." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.2.2 Apply the imputation to both train and test splits" ] }, { "cell_type": "code", "execution_count": 194, "metadata": {}, "outputs": [], "source": [ "X_tr = X_train.fillna(X_defaults_mean)\n", "X_te = X_test.fillna(X_defaults_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.2.3 Scale the data" ] }, { "cell_type": "code", "execution_count": 195, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "scaler.fit(X_tr)\n", "X_tr_scaled = scaler.transform(X_tr)\n", "X_te_scaled = scaler.transform(X_te)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.2.4 Train the model on the train split" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [], "source": [ "lm = LinearRegression().fit(X_tr_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.2.5 Make predictions using the model on both train and test splits" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [], "source": [ "y_tr_pred = lm.predict(X_tr_scaled)\n", "y_te_pred = lm.predict(X_te_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.8.1.2.6 Assess model performance" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.8221207605475709, 0.7290195691422242)" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8.510780313012354, 9.565093916371973)" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(109.74247309324214, 155.34936226136)" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results don't seem very different to when the one's we used with the median for imputing missing values. Perhaps it doesn't make much difference here. Maybe our overtraining is worse than we thought. Maybe other feature transformations, such as taking the log, would help. We could try with just a subset of features rather than using all of them as inputs.\n", "\n", "To perform the median/mean comparison, we copied and pasted a lot of code just to change the function for imputing missing values. It would make more sense to write a function that performed the sequence of steps:\n", "1. impute missing values\n", "2. scale the features\n", "3. train a model\n", "4. calculate model performance\n", "\n", "These are common steps, and `sklearn` provides something much better than writing custom functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.8.2 Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most important and useful components of `sklearn` is the [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). In place of Pandas's `fillna` DataFrame method, there is `sklearn`'s `SimpleImputer`. Remember the first linear model above performed the steps:\n", "\n", "1. replace missing values with the median for each feature\n", "2. scale the data to zero mean and unit variance\n", "3. train a linear regression model\n", "\n", "and all these steps were trained on the `train split` and then applied to the `test split` for assessment.\n", "\n", "The pipeline below defines exactly those same steps. Crucially, the resultant `Pipeline` object has a `fit()` method and a `predict()` method, just like the `LinearRegression()` object itself. Just as we might create a linear regression model and train it with `.fit()` and predict with `.predict()`, we can wrap the entire process of imputing and feature scaling and regression in a single object you can train with `.fit()` and predict with `.predict()`. And that's basically a pipeline: a model on steroids." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.8.2.1 Define the pipeline" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "pipe = make_pipeline(\n", " SimpleImputer(strategy='median'), \n", " StandardScaler(), \n", " LinearRegression()\n", ")" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sklearn.pipeline.Pipeline" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(pipe)" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(True, True)" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hasattr(pipe, 'fit'), hasattr(pipe, 'predict')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.8.2.2 Fit the pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, a single call to the pipeline's `fit()` method combines the steps of learning the imputation (determining what values to use to fill the missing ones), the scaling (determining the mean to subtract and the variance to divide by), and then training the model. It does this all in the one call with the training data as arguments." ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n", " ('standardscaler', StandardScaler()),\n", " ('linearregression', LinearRegression())])" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call the pipe's `fit()` method with `X_train` and `y_train` as arguments\n", "pipe.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.8.2.3 Make predictions on the train and test sets" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "y_tr_pred = pipe.predict(X_train)\n", "y_te_pred = pipe.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.8.2.4 Assess performance" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.8237204449411376, 0.7251410286259974)" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And compare with our earlier (non-pipeline) result:" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.8237204449411376, 0.7251410286259974)" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median_r2" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8.495768235382354, 9.696652536263656)" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare with our earlier result:" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8.495768235382354, 9.696652536263656)" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median_mae" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108.75554891895914, 157.57287631288543)" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare with our earlier result:" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108.75554891895914, 157.57287631288543)" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median_mse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results confirm the pipeline is doing exactly what's expected, and results are identical to our earlier steps. This allows we to move faster but with confidence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.9 Refining The Linear Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We suspected the model was overfitting. This is no real surprise given the number of features we blindly used. It's likely a judicious subset of features would generalize better. `sklearn` has a number of feature selection functions available. The one we'll use here is `SelectKBest` which, as we might guess, selects the k best features. We can read about SelectKBest \n", "[here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest). `f_regression` is just the [score function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression) We're using because we're performing regression. It's important to choose an appropriate one for our machine learning task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.1 Define the pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Redefine our pipeline to include this feature selection step:" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [], "source": [ "pipe = make_pipeline(\n", " SimpleImputer(strategy='median'), \n", " StandardScaler(),\n", " SelectKBest(score_func=f_regression),\n", " LinearRegression()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.2 Fit the pipeline" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n", " ('standardscaler', StandardScaler()),\n", " ('selectkbest',\n", " SelectKBest(score_func=)),\n", " ('linearregression', LinearRegression())])" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.3 Assess performance on the train and test set" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "y_tr_pred = pipe.predict(X_train)\n", "y_te_pred = pipe.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.760478965339582, 0.681569974499793)" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9.757130441228263, 10.585905291034962)" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has made things worse! Clearly selecting a subset of features has an impact on performance. `SelectKBest` defaults to k=10. Let's create a new pipeline with a different value of k:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.4 Define a new pipeline to select a different number of features" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [], "source": [ "# Modify the `SelectKBest` step to use a value of 15 for k\n", "pipe15 = make_pipeline(\n", " SimpleImputer(strategy='median'), \n", " StandardScaler(),\n", " SelectKBest(score_func=f_regression, k=15),\n", " LinearRegression()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.5 Fit the pipeline" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n", " ('standardscaler', StandardScaler()),\n", " ('selectkbest',\n", " SelectKBest(k=15,\n", " score_func=)),\n", " ('linearregression', LinearRegression())])" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe15.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.6 Assess performance on train and test data" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [], "source": [ "y_tr_pred = pipe15.predict(X_train)\n", "y_te_pred = pipe15.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.7922946911681397, 0.66079117939879)" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9.214834764542976, 10.496823817105572)" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could keep going, trying different values of k, training a model, measuring performance on the test set, and then picking the model with the best test set performance. There's a fundamental problem with this approach: _we're tuning the model to the arbitrary test set_! If we continue this way we'll end up with a model works well on the particular quirks of our test set _but fails to generalize to new data_. The whole point of keeping a test set is for it to be a set of unseen data on which to test performance.\n", "\n", "The way around this is a technique called _cross-validation_. We partition the training set into k folds, train our model on k-1 of those folds, and calculate performance on the fold not used in training. This procedure then cycles through k times with a different fold held back each time. Thus we end up building k models on k sets of data with k estimates of how the model performs on unseen data but without having to touch the test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.7 Assessing performance using cross-validation" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "# Run 5-Fold Cross validation\n", "cv_results = cross_validate(pipe15, X_train, y_train, cv=5)" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.60510478, 0.67731713, 0.75047442, 0.58935004, 0.50041885])" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get scores\n", "cv_scores = cv_results['test_score']\n", "cv_scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without using the same random state for initializing the CV folds, our actual numbers will be different." ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.6245330431201284, 0.08445948393083175)" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(cv_scores), np.std(cv_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results highlight that assessing model performance in inherently open to variability. We'll get different results depending on the quirks of which points are in which fold. An advantage of this is that you can also obtain an estimate of the variability, or uncertainty, in our performance estimate." ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.46, 0.79])" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.round((np.mean(cv_scores) - 2 * np.std(cv_scores), np.mean(cv_scores) + 2 * np.std(cv_scores)), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.9.8 Hyperparameter search using GridSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pulling the above together, we have:\n", "* a pipeline that\n", " * imputes missing values\n", " * scales the data\n", " * selects the k best features\n", " * trains a linear regression model\n", "* a technique (cross-validation) for estimating model performance\n", "\n", "Now we will use cross-validation for multiple values of k, and then use cross-validation to pick the value of k that gives the best performance. `make_pipeline` automatically names each step in lowercase. Parameters of each step are then accessed by appending a double underscore followed by the parameter name. We know the name of the step will be 'selectkbest', and we know the parameter is 'k'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also list the names of all the parameters in a pipeline as follows:" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize', 'linearregression__positive'])" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call `pipe`'s `get_params()` method to get a dict of available parameters and print their names\n", "# using dict's `keys()` method\n", "pipe.get_params().keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above can be particularly useful as our pipelines becomes more complex (we can even nest pipelines within pipelines)." ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "k = [k+1 for k in range(len(X_train.columns))]\n", "grid_params = {'selectkbest__k': k}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a range of `k` to investigate. Is 1 feature best? 2? 3? 4? All of them? We could write a for loop and iterate over each possible value, doing all the housekeeping ourselves to track the best value of k. But this is a common task, so there's a built in function in `sklearn`. This is [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).\n", "\n", "This takes the pipeline object, in fact it takes anything with a `.fit()` and `.predict()` method. In simple cases with no feature selection or imputation or feature scaling etc. we may see the classifier or regressor object itself directly passed into `GridSearchCV`. The other key input is the set of parameters and values to search over. Optional parameters include the cross-validation strategy and number of CPUs to use." ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [], "source": [ "lr_grid_cv = GridSearchCV(pipe, param_grid=grid_params, cv=5, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='median')),\n", " ('standardscaler', StandardScaler()),\n", " ('selectkbest',\n", " SelectKBest(score_func=)),\n", " ('linearregression',\n", " LinearRegression())]),\n", " n_jobs=-1,\n", " param_grid={'selectkbest__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,\n", " 12, 13, 14, 15, 16, 17, 18, 19, 20,\n", " 21, 22, 23, 24, 25, 26, 27, 28, 29,\n", " 30, ...]})" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr_grid_cv.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [], "source": [ "score_mean = lr_grid_cv.cv_results_['mean_test_score']\n", "score_std = lr_grid_cv.cv_results_['std_test_score']\n", "cv_k = [k for k in lr_grid_cv.cv_results_['param_selectkbest__k']]" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'selectkbest__k': 6}" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the `best_params_` attribute of `lr_grid_cv`\n", "lr_grid_cv.best_params_" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Assign the value of k from the above dict of `best_params_` and assign it to `best_k`\n", "best_k = lr_grid_cv.best_params_['selectkbest__k']\n", "plt.subplots(figsize=(10, 5))\n", "plt.errorbar(cv_k, score_mean, yerr=score_std)\n", "plt.axvline(x=best_k, c='r', ls='--', alpha=.5)\n", "plt.xlabel('k')\n", "plt.ylabel('CV score (r-squared)')\n", "plt.title('Pipeline mean CV score (error bars +/- 1sd)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above suggests a good value for k is 8. There was an initial rapid increase with k, followed by a slow decline. Also noticeable is the variance of the results greatly increase above k=8. As we increasingly overfit, expect greater swings in performance as different points move in and out of the train/test folds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which features were most useful? Step into our best model, shown below. Starting with the fitted grid search object, you get the best estimator, then the named step 'selectkbest', for which you can its `get_support()` method for a logical mask of the features selected." ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "selected = lr_grid_cv.best_estimator_.named_steps.selectkbest.get_support()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, instead of using the 'selectkbest' named step, we access the named step for the linear regression model and, from that, grab the model coefficients via its `coef_` attribute:" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "vertical_drop 10.333065\n", "Snow Making_ac 6.376653\n", "fastQuads 4.980331\n", "total_chairs 3.416636\n", "Runs 3.233735\n", "SkiableTerrain_ac -3.259420\n", "dtype: float64" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the linear model coefficients from the `coef_` attribute and store in `coefs`,\n", "# get the matching feature names from the column names of the dataframe,\n", "# and display the results as a pandas Series with `coefs` as the values and `features` as the index,\n", "# sorting the values in descending order\n", "coefs = lr_grid_cv.best_estimator_.named_steps.linearregression.coef_\n", "features = X_train.columns[selected]\n", "pd.Series(coefs, index=features).sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results suggest that vertical drop is our biggest positive feature. This makes intuitive sense and is consistent with what you saw during the EDA work. Also, we see the area covered by snow making equipment is a strong positive as well. People like guaranteed skiing! The skiable terrain area is negatively associated with ticket price! This seems odd. People will pay less for larger resorts? There could be all manner of reasons for this. It could be an effect whereby larger resorts can host more visitors at any one time and so can charge less per ticket. \n", "\n", "As has been mentioned previously, the data are missing information about visitor numbers. Bear in mind, the coefficient for skiable terrain is negative _for this model_. For example, if you kept the total number of chairs and fastQuads constant, but increased the skiable terrain extent, you might imagine the resort is worse off because the chairlift capacity is stretched thinner." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.10 Random Forest Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A model that can work very well in a lot of cases is the random forest. For regression, this is provided by `sklearn`'s `RandomForestRegressor` class.\n", "\n", "Time to stop the bad practice of repeatedly checking performance on the test split. Instead, go straight from defining the pipeline to assessing performance using cross-validation. `cross_validate` will perform the fitting as part of the process. This uses the default settings for the random forest so you'll then proceed to investigate some different hyperparameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.10.1 Define the pipeline" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "# Define a pipeline comprising the steps:\n", "# SimpleImputer() with a strategy of 'median'\n", "# StandardScaler(),\n", "# and then RandomForestRegressor() with a random state of 47\n", "RF_pipe = make_pipeline(\n", " SimpleImputer(strategy='median'),\n", " StandardScaler(),\n", " RandomForestRegressor(random_state=47)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.10.2 Fit and assess performance using cross-validation" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "# Call `cross_validate` to estimate the pipeline's performance.\n", "# Pass it the random forest pipe object, `X_train` and `y_train`,\n", "# and get it to use 5-fold cross-validation\n", "rf_default_cv_results = cross_validate(RF_pipe, X_train, y_train, cv=5)" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.6711019 , 0.78433505, 0.75960138, 0.59978811, 0.56699816])" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_cv_scores = rf_default_cv_results['test_score']\n", "rf_cv_scores" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.6763649193918256, 0.08536820259910885)" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(rf_cv_scores), np.std(rf_cv_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.10.3 Hyperparameter search using GridSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Random forest has a number of hyperparameters that can be explored, however, here we'll limit ourselves to exploring some different values for the number of trees. We'll try it with and without feature scaling, and try both the mean and median as strategies for imputing missing values." ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'randomforestregressor__n_estimators': [10,\n", " 12,\n", " 16,\n", " 20,\n", " 26,\n", " 33,\n", " 42,\n", " 54,\n", " 69,\n", " 88,\n", " 112,\n", " 143,\n", " 183,\n", " 233,\n", " 297,\n", " 379,\n", " 483,\n", " 615,\n", " 784,\n", " 1000],\n", " 'standardscaler': [StandardScaler(), None],\n", " 'simpleimputer__strategy': ['mean', 'median']}" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_est = [int(n) for n in np.logspace(start=1, stop=3, num=20)]\n", "grid_params = {\n", " 'randomforestregressor__n_estimators': n_est,\n", " 'standardscaler': [StandardScaler(), None],\n", " 'simpleimputer__strategy': ['mean', 'median']\n", "}\n", "grid_params" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [], "source": [ "# Call `GridSearchCV` with the random forest pipeline, passing in the above `grid_params`\n", "# dict for parameters to evaluate, 5-fold cross-validation, and all available CPU cores (if desired)\n", "rf_grid_cv = GridSearchCV(RF_pipe, param_grid=grid_params, cv=5, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='median')),\n", " ('standardscaler', StandardScaler()),\n", " ('randomforestregressor',\n", " RandomForestRegressor(random_state=47))]),\n", " n_jobs=-1,\n", " param_grid={'randomforestregressor__n_estimators': [10, 12, 16, 20,\n", " 26, 33, 42, 54,\n", " 69, 88, 112,\n", " 143, 183, 233,\n", " 297, 379, 483,\n", " 615, 784,\n", " 1000],\n", " 'simpleimputer__strategy': ['mean', 'median'],\n", " 'standardscaler': [StandardScaler(), None]})" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now call the `GridSearchCV`'s `fit()` method with `X_train` and `y_train` as arguments\n", "# to actually start the grid search. This may take a minute or two.\n", "rf_grid_cv.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'randomforestregressor__n_estimators': 233,\n", " 'simpleimputer__strategy': 'mean',\n", " 'standardscaler': None}" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the best params (`best_params_` attribute) from the grid search\n", "rf_grid_cv.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like imputing with the median helps, but scaling the features doesn't." ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.67199826, 0.78788539, 0.7776494 , 0.61844222, 0.60114645])" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_best_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, cv=5)\n", "rf_best_scores = rf_best_cv_results['test_score']\n", "rf_best_scores" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.691424343745292, 0.07822193232981417)" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(rf_best_scores), np.std(rf_best_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've marginally improved upon the default CV results. Random forest has many more hyperparameters you could tune, but we won't dive into that here." ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot a barplot of the random forest's feature importances,\n", "# assigning the `feature_importances_` attribute of \n", "# `rf_grid_cv.best_estimator_.named_steps.randomforestregressor` to the name `imps` to then\n", "# create a pandas Series object of the feature importances, with the index given by the\n", "# training data column names, sorting the values in descending order\n", "plt.subplots(figsize=(10, 5))\n", "imps = rf_grid_cv.best_estimator_.named_steps.randomforestregressor.feature_importances_\n", "rf_feat_imps = pd.Series(imps, index=X_train.columns).sort_values(ascending=False)\n", "rf_feat_imps.plot(kind='bar')\n", "plt.xlabel('features')\n", "plt.ylabel('importance')\n", "plt.title('Best random forest regressor feature importances');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encouragingly, the dominant top four features are in common with our linear model:\n", "* fastQuads\n", "* Runs\n", "* Snow Making_ac\n", "* vertical_drop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.11 Final Model Selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time to select our final model to use for further business modeling! It would be good to revisit the above model selection; there is undoubtedly more that could be done to explore possible hyperparameters.\n", "It would also be worthwhile to investigate removing the least useful features. Gathering or calculating, and storing, features adds business cost and dependencies, so if features genuinely are not needed they should be removed.\n", "Building a simpler model with fewer features can also have the advantage of being easier to sell (and/or explain) to stakeholders.\n", "Certainly there seem to be four strong features here and so a model using only those would probably work well.\n", "However, we want to explore some different scenarios where other features vary so keep the fuller \n", "model for now. \n", "The business is waiting for this model and we have something that we have confidence in to be much better than guessing with the average price.\n", "\n", "Or, rather, we have two \"somethings\". We built a best linear model and a best random forest model. We need to finally choose between them. We can calculate the mean absolute error using cross-validation. Although `cross-validate` defaults to the $R^2$ [metric for scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring) regression, we can specify the mean absolute error as an alternative via\n", "the `scoring` parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.11.1 Linear regression model performance" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "# 'neg_mean_absolute_error' uses the (negative of) the mean absolute error\n", "lr_neg_mae = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, \n", " scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10.891687453692906, 1.867599419260583)" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr_mae_mean = np.mean(-1 * lr_neg_mae['test_score'])\n", "lr_mae_std = np.std(-1 * lr_neg_mae['test_score'])\n", "lr_mae_mean, lr_mae_std" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10.314388905802957" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.11.2 Random forest regression model performance" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "rf_neg_mae = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, \n", " scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9.829508627130718, 1.0814411685916026)" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_mae_mean = np.mean(-1 * rf_neg_mae['test_score'])\n", "rf_mae_std = np.std(-1 * rf_neg_mae['test_score'])\n", "rf_mae_mean, rf_mae_std" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9.539958614347025" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_absolute_error(y_test, rf_grid_cv.best_estimator_.predict(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.11.3 Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The random forest model has a lower cross-validation mean absolute error by almost ~$1. It also exhibits less variability. Verifying performance on the test set produces performance consistent with the cross-validation results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.12 Data quantity assessment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we need to advise the business whether it needs to undertake further data collection. Would more data be useful? We're often led to believe more data is always good, but gathering data invariably has a cost associated with it. Assess this trade off by seeing how performance varies with differing data set sizes. The `learning_curve` function does this conveniently." ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [], "source": [ "fractions = [.2, .25, .3, .35, .4, .45, .5, .6, .75, .8, 1.0]\n", "train_size, train_scores, test_scores = learning_curve(pipe, X_train, y_train, train_sizes=fractions)\n", "train_scores_mean = np.mean(train_scores, axis=1)\n", "train_scores_std = np.std(train_scores, axis=1)\n", "test_scores_mean = np.mean(test_scores, axis=1)\n", "test_scores_std = np.std(test_scores, axis=1)" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.subplots(figsize=(10, 5))\n", "plt.errorbar(train_size, test_scores_mean, yerr=test_scores_std)\n", "plt.xlabel('Training set size')\n", "plt.ylabel('CV scores')\n", "plt.title('Cross-validation score as training set size increases');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows that you seem to have plenty of data. There's an initial rapid improvement in model scores as one would expect, but it's essentially levelled off by around a sample size of 40-50." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.13 Save best model object from pipeline" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [], "source": [ "best_model = rf_grid_cv.best_estimator_\n", "best_model.version = '1.0'\n", "best_model.pandas_version = pd.__version__\n", "best_model.numpy_version = np.__version__\n", "best_model.sklearn_version = sklearn_version\n", "best_model.X_columns = [col for col in X_train.columns]\n", "best_model.build_datetime = datetime.datetime.now()" ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A file already exists with this name.\n", "\n", "Do you want to overwrite? (Y/N)Y\n", "Writing file. \"../models/ski_resort_pricing_model.pkl\"\n" ] } ], "source": [ "# save the model\n", "\n", "modelpath = '../models'\n", "save_file(best_model, 'ski_resort_pricing_model.pkl', modelpath)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }