--- title: "Machine Learning Workflow: Boosting for Prediction" subtitle: "Biostat 274" author: "Dr. Jin Zhou @ UCLA" date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Display system information for reproducibility. ::: {.panel-tabset} #### Python ```{python} import IPython print(IPython.sys_info()) ``` ::: ## Overview ![](https://www.tidymodels.org/start/resampling/img/resampling.svg) We illustrate the typical machine learning workflow for random forests using the `Hitters` data set from R `ISLR2` package. 1. Initial splitting to test and non-test sets. 2. Pre-processing of data: not much is needed for regression trees. 3. Tune the cost complexity pruning hyper-parameter(s) using 10-fold cross-validation (CV) on the non-test data. 4. Choose the best model by CV and refit it on the whole non-test data. 5. Final prediction on the test data. ## Hitters data A documentation of the `Hitters` data is [here](https://www.rdocumentation.org/packages/ISLR2/versions/1.3-2/topics/Hitters). The goal is to predict the log(Salary) (at opening of 1987 season) of MLB players from their performance metrics in the 1986-7 season. ::: {.panel-tabset} #### R - There are 59 `NA`s for the salary. Let’s drop those cases. We are left with 263 data points. ```{r} library(GGally) library(gtsummary) library(ranger) library(tidyverse) library(tidymodels) library(ISLR2) # Numerical summaries stratified by the outcome `AHD`. Hitters %>% tbl_summary() Hitters <- Hitters %>% filter(!is.na(Salary)) %>% select(Salary, Assists, AtBat, Hits, HmRun, PutOuts, RBI, Runs, Walks, Years) ``` #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font sizes in plots sns.set(font_scale = 1.2) # Display all columns pd.set_option('display.max_columns', None) Hitters = pd.read_csv("./slides/data/Hitters.csv") Hitters ``` ```{python} # Numerical summaries Hitters.describe() ``` Graphical summary: ```{python} #| eval: false # Graphical summaries plt.figure() sns.pairplot(data = Hitters); plt.show() ``` There are 59 `NA`s for the salary. Let’s drop those cases. We are left with 263 data points. ```{python} Hitters.dropna(inplace = True) Hitters.shape ``` ::: ## Initial split into test and non-test sets We randomly split the data in half of test data and another half of non-test data. ::: {.panel-tabset} ```{r} # For reproducibility set.seed(203) data_split <- initial_split( Hitters, prop = 0.5 ) data_split Hitters_other <- training(data_split) dim(Hitters_other) Hitters_test <- testing(data_split) dim(Hitters_test) ``` #### Python ```{python} from sklearn.model_selection import train_test_split Hitters_other, Hitters_test = train_test_split( Hitters, train_size = 0.5, random_state = 425, # seed ) Hitters_test.shape Hitters_other.shape ``` Separate $X$ and $y$. We will use 9 of the features. ```{python} features = ['Assists', 'AtBat', 'Hits', 'HmRun', 'PutOuts', 'RBI', 'Runs', 'Walks', 'Years'] # Non-test X and y X_other = Hitters_other[features] y_other = np.log(Hitters_other.Salary) # Test X and y X_test = Hitters_test[features] y_test = np.log(Hitters_test.Salary) ``` #### R ::: ## Preprocessing (Python) or recipe (R) ::: {.panel-tabset} #### R ```{r} gb_recipe <- recipe( Salary ~ ., data = Hitters_other ) %>% # # create traditional dummy variables (not necessary for random forest in R) # step_dummy(all_nominal()) %>% step_naomit(Salary) %>% # zero-variance filter step_zv(all_numeric_predictors()) # %>% # # center and scale numeric data (not necessary for random forest) # step_normalize(all_numeric_predictors()) %>% # estimate the means and standard deviations # prep(training = Hitters_other, retain = TRUE) gb_recipe ``` #### Python Not much preprocessing is needed here since all predictors are quantitative. ::: ## Model ::: {.panel-tabset} #### R ```{r} gb_mod <- boost_tree( mode = "regression", trees = 1000, tree_depth = tune(), learn_rate = tune() ) %>% set_engine("xgboost") gb_mod ``` #### Python ```{python} from sklearn.ensemble import AdaBoostRegressor from sklearn.tree import DecisionTreeRegressor bst_mod = AdaBoostRegressor( # Default base estimator is DecisionTreeRegressor with max_depth = 3 estimator = DecisionTreeRegressor(max_depth = 3), # Number of trees (to be tuned) n_estimators = 50, # Learning rate (to be tuned) learning_rate = 1.0, random_state = 425 ) ``` ::: ## Pipeline (Python) or workflow (R) Here we bundle the preprocessing step (Python) or recipe (R) and model. ::: {.panel-tabset} #### R ```{r} gb_wf <- workflow() %>% add_recipe(gb_recipe) %>% add_model(gb_mod) gb_wf ``` #### Python ```{python} from sklearn.pipeline import Pipeline pipe = Pipeline(steps = [ ("model", bst_mod) ]) pipe ``` ::: ## Tuning grid Here we tune the number of trees `n_estimators` and the learning rate `learning_rate`. ::: {.panel-tabset} #### R Here we tune the number of trees `trees` and the number of features to use in each split `mtry`. ```{r} param_grid <- grid_regular( tree_depth(range = c(1L, 4L)), learn_rate(range = c(-3, -0.5), trans = log10_trans()), levels = c(4, 10) ) param_grid ``` #### Python ```{python} # Tune hyper-parameter(s) d_grid = [ DecisionTreeRegressor(max_depth = 1), DecisionTreeRegressor(max_depth = 2), DecisionTreeRegressor(max_depth = 3), DecisionTreeRegressor(max_depth = 4) ] B_grid = [50, 100, 150, 200, 250, 300, 350, 400] lambda_grid = [0.2, 0.4, 0.6, 0.8, 1.0] tuned_parameters = { "model__estimator": d_grid, "model__n_estimators": B_grid, "model__learning_rate": lambda_grid } tuned_parameters ``` ::: ## Cross-validation (CV) ::: {.panel-tabset} #### R Set cross-validation partitions. ```{r} set.seed(203) folds <- vfold_cv(Hitters_other, v = 5) folds ``` Fit cross-validation. ```{r} gb_fit <- gb_wf %>% tune_grid( resamples = folds, grid = param_grid, metrics = metric_set(rmse, rsq) ) gb_fit ``` Visualize CV results: ```{r} gb_fit %>% collect_metrics() %>% print(width = Inf) %>% filter(.metric == "rmse") %>% ggplot(mapping = aes(x = learn_rate, y = mean, color = factor(tree_depth))) + geom_point() + geom_line() + labs(x = "Learning Rate", y = "CV AUC") + scale_x_log10() ``` Show the top 5 models. ```{r} gb_fit %>% show_best(metric = "rmse") ``` Let's select the best model. ```{r} best_gb <- gb_fit %>% select_best(metric = "rmse") best_gb ``` #### Python Set up CV partitions and CV criterion. ```{python} from sklearn.model_selection import GridSearchCV # Set up CV n_folds = 6 search = GridSearchCV( pipe, tuned_parameters, cv = n_folds, scoring = "neg_root_mean_squared_error", # Refit the best model on the whole data set refit = True ) ``` Fit CV. This is typically the most time-consuming step. ```{python} # Fit CV search.fit(X_other, y_other) ``` Visualize CV results. ```{python} #| eval: true #| code-fold: true cv_res = pd.DataFrame({ "B": np.array(search.cv_results_["param_model__n_estimators"]), "rmse": -search.cv_results_["mean_test_score"], "lambda": search.cv_results_["param_model__learning_rate"], "depth": search.cv_results_["param_model__estimator"], }) plt.figure() sns.relplot( # kind = "line", data = cv_res, x = "B", y = "rmse", hue = "lambda", style = "depth" ).set( xlabel = "B", ylabel = "CV RMSE" ); plt.show() ``` Best CV RMSE: ```{python} -search.best_score_ ``` ::: ## Finalize our model Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data. ::: {.panel-tabset} #### R ```{r} # Final workflow final_wf <- gb_wf %>% finalize_workflow(best_gb) final_wf ``` ```{r} # Fit the whole training set, then predict the test cases final_fit <- final_wf %>% last_fit(data_split) final_fit ``` ```{r} # Test metrics final_fit %>% collect_metrics() ``` #### Python Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available. ```{python} search.best_estimator_ ``` The final prediction RMSE on the test set is ```{python} from sklearn.metrics import mean_squared_error mean_squared_error( y_test, search.best_estimator_.predict(X_test) ) ``` ::: ## Visualize the final model ```{r} #library(rpart.plot) final_tree <- extract_workflow(final_fit) final_tree ``` ```{r} library(vip) final_tree %>% extract_fit_parsnip() %>% vip() ```