--- title: "Machine Learning Workflow: Boosting for Prediction" subtitle: "Econ 425T" author: "Dr. Hua Zhou @ UCLA" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Display system information for reproducibility. ::: {.panel-tabset} #### Python ```{python} import IPython print(IPython.sys_info()) ``` ::: ## Overview ![](https://www.tidymodels.org/start/resampling/img/resampling.svg) We illustrate the typical machine learning workflow for random forests using the `Hitters` data set from R `ISLR2` package. 1. Initial splitting to test and non-test sets. 2. Pre-processing of data: not much is needed for regression trees. 3. Tune the cost complexity pruning hyper-parameter(s) using 10-fold cross-validation (CV) on the non-test data. 4. Choose the best model by CV and refit it on the whole non-test data. 5. Final prediction on the test data. ## Hitters data A documentation of the `Hitters` data is [here](https://www.rdocumentation.org/packages/ISLR2/versions/1.3-2/topics/Hitters). The goal is to predict the log(Salary) (at opening of 1987 season) of MLB players from their performance metrics in the 1986-7 season. ::: {.panel-tabset} #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font sizes in plots sns.set(font_scale = 1.2) # Display all columns pd.set_option('display.max_columns', None) Hitters = pd.read_csv("../data/Hitters.csv") Hitters ``` ```{python} # Numerical summaries Hitters.describe() ``` Graphical summary: ```{python} #| eval: false # Graphical summaries plt.figure() sns.pairplot(data = Hitters); plt.show() ``` There are 59 `NA`s for the salary. Let’s drop those cases. We are left with 263 data points. ```{python} Hitters.dropna(inplace = True) Hitters.shape ``` ::: ## Initial split into test and non-test sets We randomly split the data in half of test data and another half of non-test data. ::: {.panel-tabset} #### Python ```{python} from sklearn.model_selection import train_test_split Hitters_other, Hitters_test = train_test_split( Hitters, train_size = 0.5, random_state = 425, # seed ) Hitters_test.shape Hitters_other.shape ``` Separate $X$ and $y$. We will use 9 of the features. ```{python} features = ['Assists', 'AtBat', 'Hits', 'HmRun', 'PutOuts', 'RBI', 'Runs', 'Walks', 'Years'] # Non-test X and y X_other = Hitters_other[features] y_other = np.log(Hitters_other.Salary) # Test X and y X_test = Hitters_test[features] y_test = np.log(Hitters_test.Salary) ``` #### R ::: ## Preprocessing (Python) or recipe (R) ::: {.panel-tabset} #### Python Not much preprocessing is needed here since all predictors are quantitative. #### R ::: ## Model ::: {.panel-tabset} #### Python ```{python} from sklearn.ensemble import AdaBoostRegressor from sklearn.tree import DecisionTreeRegressor bst_mod = AdaBoostRegressor( # Default base estimator is DecisionTreeRegressor with max_depth = 3 estimator = DecisionTreeRegressor(max_depth = 3), # Number of trees (to be tuned) n_estimators = 50, # Learning rate (to be tuned) learning_rate = 1.0, random_state = 425 ) ``` #### R ::: ## Pipeline (Python) or workflow (R) Here we bundle the preprocessing step (Python) or recipe (R) and model. ::: {.panel-tabset} #### Python ```{python} from sklearn.pipeline import Pipeline pipe = Pipeline(steps = [ ("model", bst_mod) ]) pipe ``` #### R ::: ## Tuning grid Here we tune the number of trees `n_estimators` and the learning rate `learning_rate`. ::: {.panel-tabset} #### Python ```{python} # Tune hyper-parameter(s) d_grid = [ DecisionTreeRegressor(max_depth = 1), DecisionTreeRegressor(max_depth = 2), DecisionTreeRegressor(max_depth = 3), DecisionTreeRegressor(max_depth = 4) ] B_grid = [50, 100, 150, 200, 250, 300, 350, 400] lambda_grid = [0.2, 0.4, 0.6, 0.8, 1.0] tuned_parameters = { "model__estimator": d_grid, "model__n_estimators": B_grid, "model__learning_rate": lambda_grid } tuned_parameters ``` #### R ::: ## Cross-validation (CV) ::: {.panel-tabset} #### Python Set up CV partitions and CV criterion. ```{python} from sklearn.model_selection import GridSearchCV # Set up CV n_folds = 6 search = GridSearchCV( pipe, tuned_parameters, cv = n_folds, scoring = "neg_root_mean_squared_error", # Refit the best model on the whole data set refit = True ) ``` Fit CV. This is typically the most time-consuming step. ```{python} # Fit CV search.fit(X_other, y_other) ``` Visualize CV results. ```{python} #| eval: true #| code-fold: true cv_res = pd.DataFrame({ "B": np.array(search.cv_results_["param_model__n_estimators"]), "rmse": -search.cv_results_["mean_test_score"], "lambda": search.cv_results_["param_model__learning_rate"], "depth": search.cv_results_["param_model__estimator"], }) plt.figure() sns.relplot( # kind = "line", data = cv_res, x = "B", y = "rmse", hue = "lambda", style = "depth" ).set( xlabel = "B", ylabel = "CV RMSE" ); plt.show() ``` Best CV RMSE: ```{python} -search.best_score_ ``` #### R ::: ## Finalize our model Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data. ::: {.panel-tabset} #### Python Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available. ```{python} search.best_estimator_ ``` The final prediction RMSE on the test set is ```{python} from sklearn.metrics import mean_squared_error mean_squared_error( y_test, search.best_estimator_.predict(X_test), squared = False ) ``` #### R :::