---
title: "Machine Learning Workflow: Random Forest for Classification (Heart Data)"
subtitle: "Biostat 203B"
author: "Dr. Hua Zhou @ UCLA"
date: "`r format(Sys.time(), '%d %B, %Y')`"
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

Display system information for reproducibility.

::: {.panel-tabset}

#### R

```{r}
sessionInfo()
```

#### Python

```{python}
import IPython
print(IPython.sys_info())
```

:::

## Overview

![](https://www.tidymodels.org/start/resampling/img/resampling.svg)


We illustrate the typical machine learning workflow for random forest using the `Heart` data set. The outcome is `AHD` (`Yes` or `No`).

1. Initial splitting to test and non-test sets.

2. Pre-processing of data: dummy coding categorical variables, standardizing numerical variables, imputing missing values, ...

3. Tune the random forest using 5-fold cross-validation (CV) on the non-test data.

4. Choose the best model by CV and refit it on the whole non-test data.

5. Final classification on the test data.

## Heart data

The goal is to predict the binary outcome `AHD` (`Yes` or `No`) of patients.

::: {.panel-tabset}

#### R

```{r}
# Load libraries
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)

# Load the `Heart.csv` data.
Heart <- read_csv("Heart.csv") %>% 
  # first column is patient ID, which we don't need
  select(-1) %>%
  # RestECG is categorical with value 0, 1, 2
  mutate(RestECG = as.factor(RestECG)) %>%
  print(width = Inf)

# Numerical summaries stratified by the outcome `AHD`.
Heart %>% tbl_summary(by = AHD)

# Graphical summary:
# Heart %>% ggpairs()
```

#### Python

```{python}
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

Heart = pd.read_csv("Heart.csv")
Heart
```

```{python}
# Numerical summaries
Heart.describe(include = 'all')
```

Graphical summary:
```{python}
#| eval: false
# Graphical summaries
plt.figure()
sns.pairplot(data = Heart);
plt.show()
```

:::

## Initial split into test and non-test sets

We randomly split the data into 25% test data and 75% non-test data. Stratify on `AHD`.

::: {.panel-tabset}

#### R

```{r}
# For reproducibility
set.seed(203)

data_split <- initial_split(
  Heart, 
  # stratify by AHD
  strata = "AHD", 
  prop = 0.75
  )
data_split

Heart_other <- training(data_split)
dim(Heart_other)

Heart_test <- testing(data_split)
dim(Heart_test)
```

#### Python

```{python}
from sklearn.model_selection import train_test_split

Heart_other, Heart_test = train_test_split(
  Heart, 
  train_size = 0.75,
  random_state = 425, # seed
  stratify = Heart.AHD
  )
Heart_test.shape
Heart_other.shape
```

Separate $X$ and $y$. We will use 13 features.
```{python}
num_features = ['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca']
cat_features = ['ChestPain', 'Thal']
features = np.concatenate([num_features, cat_features])
# Non-test X and y
X_other = Heart_other[features]
y_other = Heart_other.AHD
# Test X and y
X_test = Heart_test[features]
y_test = Heart_test.AHD
```

:::

## Recipe (R) and Preprocessing (Python)

- A data dictionary (roughly) is at <https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/>.

- We have following features: 

    - Numerical features: `Age`, `RestBP`, `Chol`, `Slope` (1, 2 or 3), `MaxHR`, `ExAng`, `Oldpeak`, `Ca` (0, 1, 2 or 3).

    - Categorical features coded as integer: `Sex` (0 or 1), `Fbs` (0 or 1), `RestECG` (0, 1 or 2).

    - Categorical features coded as string: `ChestPain`, `Thal`

- There are missing values in `Ca` and `Thal`. Since missing proportion is not high, we will use simple mean (for numerical feature `Ca`) and mode (for categorical feature `Thal`) imputation.

::: {.panel-tabset}

#### R

```{r}
rf_recipe <- 
  recipe(
    AHD ~ ., 
    data = Heart_other
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  # mean imputation for Ca
  step_impute_mean(Ca) %>%
  # mode imputation for Thal
  step_impute_mode(Thal) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # # center and scale numeric data (not necessary for random forest)
  # step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep(training = Heart_other, retain = TRUE)
rf_recipe
```

#### Python

There are missing values in `Ca` (quantitative) and `Thal` (qualitative) variables. We are going to use simple `mean` imputation for `Ca` and `most_frequent` imputation for `Thal`. This is suboptimal. Better strategy is to use multiple imputation.
```{python}
# How many NaNs
Heart.isna().sum()
```

In principle, decision trees should be able to handle categorical predictors. However scikit-learn and xgboost implementations don't allow categorical predictors and require one-hot encoding.

```{python}
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Transformer for categorical variables
categorical_tf = Pipeline(steps = [
  ("cat_impute", SimpleImputer(strategy = 'most_frequent')),
  ("encoder", OneHotEncoder())
])

# Transformer for continuous variables
numeric_tf = Pipeline(steps = [
  ("num_impute", SimpleImputer(strategy = 'mean')),
])

# Column transformer
col_tf = ColumnTransformer(transformers = [
  ('num', numeric_tf, num_features),
  ('cat', categorical_tf, cat_features)
])
```

:::

## Model

::: {.panel-tabset}

#### R

```{r}
rf_mod <- 
  rand_forest(
    mode = "classification",
    # Number of predictors randomly sampled in each split
    mtry = tune(),
    # Number of trees in ensemble
    trees = tune()
  ) %>% 
  set_engine("ranger")
rf_mod
```

#### Python

```{python}
from sklearn.ensemble import RandomForestClassifier

rf_mod = RandomForestClassifier(
  # Number of trees
  n_estimators = 100, 
  criterion = 'gini',
  # Number of features to use in each split
  max_features = 'sqrt',
  oob_score = True,
  random_state = 425
  )
```

:::

## Workflow in R and pipeline in Python

Here we bundle the preprocessing step (Python) or recipe (R) and model.

::: {.panel-tabset}

#### R

```{r}
rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod)
rf_wf
```

#### Python

```{python}
from sklearn.pipeline import Pipeline

pipe = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", rf_mod)
  ])
pipe
```

:::

## Tuning grid

::: {.panel-tabset}

#### R

Here we tune the number of trees `trees` and the number of features to use in each split `mtry`.

```{r}
param_grid <- grid_regular(
  trees(range = c(100L, 300L)), 
  mtry(range = c(1L, 5L)),
  levels = c(3, 5)
  )
param_grid
```

#### Python

In general, it's not necessary to tune a random forest. Using the default of `n_estimators=100` and `max_features=1.0` (bagging) or `max_features='sqrt'` works well.

Here we tune the number of trees `n_estimators` and the number of features to use in each split `max_features`.

```{python}
# Tune hyper-parameter(s)
B_grid = [50, 100, 150, 200, 250, 300]
m_grid = ['sqrt', 'log2', 1.0] # max_features = 1.0 uses all features
tuned_parameters = {
  "model__n_estimators": B_grid,
  "model__max_features": m_grid
  }
tuned_parameters 
```

:::

## Cross-validation (CV)

::: {.panel-tabset}

#### R

Set cross-validation partitions.
```{r}
set.seed(203)

folds <- vfold_cv(Heart_other, v = 5)
folds
```

Fit cross-validation.
```{r}
rf_fit <- rf_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )
rf_fit
```

Visualize CV results:
```{r}
rf_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = trees, y = mean, color = mtry)) +
  geom_point() + 
  # geom_line() + 
  labs(x = "Num. of Trees", y = "CV AUC")
```

Show the top 5 models.
```{r}
rf_fit %>%
  show_best("roc_auc")
```
Let's select the best model.
```{r}
best_rf <- rf_fit %>%
  select_best("roc_auc")
best_rf
```

#### Python

Set up CV partitions and CV criterion.

```{python}
from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 5
search = GridSearchCV(
  pipe,
  tuned_parameters,
  cv = n_folds, 
  scoring = "roc_auc",
  # Refit the best model on the whole data set
  refit = True
  )
```

Fit CV. This is typically the most time-consuming step.
```{python}
# Fit CV
search.fit(X_other, y_other)
```

Visualize CV results.
```{python}
#| eval: true
#| code-fold: true
cv_res = pd.DataFrame({
  "B": np.array(search.cv_results_["param_model__n_estimators"]),
  "auc": search.cv_results_["mean_test_score"],
  "m": search.cv_results_["param_model__max_features"]
  })

plt.figure()
sns.relplot(
  # kind = "line",
  data = cv_res,
  x = "B",
  y = "auc",
  hue = "m"
  ).set(
    xlabel = "B",
    ylabel = "CV AUC"
);
plt.show()
```

Best CV AUC:
```{python}
search.best_score_
```

The training accuracy is
```{python}
from sklearn.metrics import accuracy_score, roc_auc_score

accuracy_score(
  y_other,
  search.best_estimator_.predict(X_other)
  )
```

:::

## Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

::: {.panel-tabset}

#### R

```{r}
# Final workflow
final_wf <- rf_wf %>%
  finalize_workflow(best_rf)
final_wf
```

```{r}
# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit
```

```{r}
# Test metrics
final_fit %>% 
  collect_metrics()
```

#### Python

Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available.
```{python}
search.best_estimator_
```

Feature importances:
```{python}
features = np.concatenate([
    features[:-2], 
    ['ChestPain:asymptomatic', 'ChestPain:nonanginal', 'ChestPain:nontypical', 'ChestPain:typical'],
    ['Thal:fixed', 'Thal:normal', 'Thal:reversable']
    ])

vi_df = pd.DataFrame({
  "feature": features,
  "vi": search.best_estimator_['model'].feature_importances_,
  "vi_std": np.std([tree.feature_importances_ for tree in search.best_estimator_['model'].estimators_], axis = 0)
  })

plt.figure()
vi_df.plot.bar(x = "feature", y = "vi", yerr = "vi_std")
plt.xticks(rotation = 90);
plt.show()
```

The final AUC on the test set is
```{python}
roc_auc_score(
  y_test,
  search.best_estimator_.predict_proba(X_test)[:, 1]
  )
```

The final classification accuracy on the test set is
```{python}
accuracy_score(
  y_test,
  search.best_estimator_.predict(X_test)
  )
```

:::