---
title: "Machine Learning Workflow: Boosting for Prediction"
subtitle: "Biostat 274"
author: "Dr. Jin Zhou @ UCLA"
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

Display system information for reproducibility.

::: {.panel-tabset}

#### Python

```{python}
import IPython
print(IPython.sys_info())
```

:::

## Overview

![](https://www.tidymodels.org/start/resampling/img/resampling.svg)


We illustrate the typical machine learning workflow for random forests using the `Hitters` data set from R `ISLR2` package. 

1. Initial splitting to test and non-test sets.

2. Pre-processing of data: not much is needed for regression trees.

3. Tune the cost complexity pruning hyper-parameter(s) using 10-fold cross-validation (CV) on the non-test data.

4. Choose the best model by CV and refit it on the whole non-test data.

5. Final prediction on the test data.

## Hitters data

A documentation of the `Hitters` data is [here](https://www.rdocumentation.org/packages/ISLR2/versions/1.3-2/topics/Hitters). The goal is to predict the log(Salary) (at opening of 1987 season) of MLB players from their performance metrics in the 1986-7 season.

::: {.panel-tabset}
#### R

- There are 59 `NA`s for the salary. Let’s drop those cases. We are left with 263 data points.

```{r}
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)
library(ISLR2)

# Numerical summaries stratified by the outcome `AHD`.
Hitters %>% tbl_summary()
Hitters <- Hitters %>% filter(!is.na(Salary)) %>%
  select(Salary, Assists, 
         AtBat, Hits, HmRun, 
         PutOuts, RBI, Runs, Walks, Years)
```


#### Python

```{python}
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

Hitters = pd.read_csv("./slides/data/Hitters.csv")
Hitters
```

```{python}
# Numerical summaries
Hitters.describe()
```
Graphical summary:
```{python}
#| eval: false
# Graphical summaries
plt.figure()
sns.pairplot(data = Hitters);
plt.show()
```

There are 59 `NA`s for the salary. Let’s drop those cases. We are left with 263 data points.

```{python}
Hitters.dropna(inplace = True)
Hitters.shape
```

:::


## Initial split into test and non-test sets

We randomly split the data in half of test data and another half of non-test data.

::: {.panel-tabset}

```{r}
# For reproducibility
set.seed(203)

data_split <- initial_split(
  Hitters, 
  prop = 0.5
  )
data_split

Hitters_other <- training(data_split)
dim(Hitters_other)

Hitters_test <- testing(data_split)
dim(Hitters_test)
```

#### Python

```{python}
from sklearn.model_selection import train_test_split

Hitters_other, Hitters_test = train_test_split(
  Hitters, 
  train_size = 0.5,
  random_state = 425, # seed
  )
Hitters_test.shape
Hitters_other.shape
```

Separate $X$ and $y$. We will use 9 of the features.
```{python}
features = ['Assists', 'AtBat', 'Hits', 'HmRun', 'PutOuts', 'RBI', 'Runs', 'Walks', 'Years']
# Non-test X and y
X_other = Hitters_other[features]
y_other = np.log(Hitters_other.Salary)
# Test X and y
X_test = Hitters_test[features]
y_test = np.log(Hitters_test.Salary)
```

#### R

:::

## Preprocessing (Python) or recipe (R)

::: {.panel-tabset}
#### R
```{r}
gb_recipe <- 
  recipe(
    Salary ~ ., 
    data = Hitters_other
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  step_naomit(Salary) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) # %>% 
  # # center and scale numeric data (not necessary for random forest)
  # step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  # prep(training = Hitters_other, retain = TRUE)
gb_recipe
```

#### Python

Not much preprocessing is needed here since all predictors are quantitative.


:::

## Model

::: {.panel-tabset}

#### R

```{r}
gb_mod <- 
  boost_tree(
    mode = "regression",
    trees = 1000, 
    tree_depth = tune(),
    learn_rate = tune()
  ) %>% 
  set_engine("xgboost")
gb_mod
```

#### Python

```{python}
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

bst_mod =  AdaBoostRegressor(
  # Default base estimator is DecisionTreeRegressor with max_depth = 3
  estimator = DecisionTreeRegressor(max_depth = 3),
  # Number of trees (to be tuned)
  n_estimators = 50, 
  # Learning rate (to be tuned)
  learning_rate = 1.0,
  random_state = 425
  )
```

:::

## Pipeline (Python) or workflow (R)

Here we bundle the preprocessing step (Python) or recipe (R) and model.

::: {.panel-tabset}
#### R
```{r}
gb_wf <- workflow() %>%
  add_recipe(gb_recipe) %>%
  add_model(gb_mod)
gb_wf
```

#### Python

```{python}
from sklearn.pipeline import Pipeline

pipe = Pipeline(steps = [
  ("model", bst_mod)
  ])
pipe
```

:::

## Tuning grid

Here we tune the number of trees `n_estimators` and the learning rate `learning_rate`.

::: {.panel-tabset}

#### R

Here we tune the number of trees `trees` and the number of features to use in each split `mtry`.

```{r}
param_grid <- grid_regular(
  tree_depth(range = c(1L, 4L)),
  learn_rate(range = c(-3, -0.5), trans = log10_trans()),
  levels = c(4, 10)
  )
param_grid
```

#### Python

```{python}
# Tune hyper-parameter(s)
d_grid = [
  DecisionTreeRegressor(max_depth = 1),
  DecisionTreeRegressor(max_depth = 2),
  DecisionTreeRegressor(max_depth = 3),
  DecisionTreeRegressor(max_depth = 4)
  ]
B_grid = [50, 100, 150, 200, 250, 300, 350, 400]
lambda_grid = [0.2, 0.4, 0.6, 0.8, 1.0]
tuned_parameters = {
  "model__estimator": d_grid,
  "model__n_estimators": B_grid,
  "model__learning_rate": lambda_grid
  }
tuned_parameters  
```

:::

## Cross-validation (CV)

::: {.panel-tabset}

#### R
Set cross-validation partitions.
```{r}
set.seed(203)

folds <- vfold_cv(Hitters_other, v = 5)
folds
```

Fit cross-validation.
```{r}
gb_fit <- gb_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(rmse, rsq)
    )
gb_fit
```

Visualize CV results:
```{r}
gb_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = learn_rate, y = mean, color = factor(tree_depth))) +
  geom_point() +
  geom_line() +
  labs(x = "Learning Rate", y = "CV AUC") +
  scale_x_log10()
```

Show the top 5 models.
```{r}
gb_fit %>%
  show_best(metric = "rmse")
```
Let's select the best model.
```{r}
best_gb <- gb_fit %>%
  select_best(metric = "rmse")
best_gb
```

#### Python

Set up CV partitions and CV criterion.

```{python}
from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 6
search = GridSearchCV(
  pipe,
  tuned_parameters,
  cv = n_folds, 
  scoring = "neg_root_mean_squared_error",
  # Refit the best model on the whole data set
  refit = True
  )
```

Fit CV. This is typically the most time-consuming step.
```{python}
# Fit CV
search.fit(X_other, y_other)
```

Visualize CV results.
```{python}
#| eval: true
#| code-fold: true
cv_res = pd.DataFrame({
  "B": np.array(search.cv_results_["param_model__n_estimators"]),
  "rmse": -search.cv_results_["mean_test_score"],
  "lambda": search.cv_results_["param_model__learning_rate"],
  "depth": search.cv_results_["param_model__estimator"],
  })

plt.figure()
sns.relplot(
  # kind = "line",
  data = cv_res,
  x = "B",
  y = "rmse",
  hue = "lambda",
  style = "depth"
  ).set(
    xlabel = "B",
    ylabel = "CV RMSE"
);
plt.show()
```

Best CV RMSE:
```{python}
-search.best_score_
```


:::

## Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

::: {.panel-tabset}

#### R
```{r}
# Final workflow
final_wf <- gb_wf %>%
  finalize_workflow(best_gb)
final_wf
```

```{r}
# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit
```

```{r}
# Test metrics
final_fit %>% 
  collect_metrics()
```

#### Python

Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available.
```{python}
search.best_estimator_
```


The final prediction RMSE on the test set is
```{python}
from sklearn.metrics import mean_squared_error

mean_squared_error(
  y_test, 
  search.best_estimator_.predict(X_test)
  )
```

:::


## Visualize the final model
```{r}
#library(rpart.plot)
final_tree <- extract_workflow(final_fit)
final_tree


```

```{r}
library(vip)

final_tree %>% 
  extract_fit_parsnip() %>% 
  vip()
```