---
title: "Machine Learning Workflow: Lasso Regression"
subtitle: "Biostat 212A"
author: "Dr. Jin Zhou @ UCLA"
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

Display system information for reproducibility.

::: {.panel-tabset}


#### R

```{r}
sessionInfo()
```

#### Python

```{python}
import IPython
print(IPython.sys_info())
```


:::

## What is tidymodels?
The tidymodels framework is a package ecosystem, in which all steps of the machine learning workflow are implemented through dedicated R packages. The consistency of these packages ensures their interoperability and ease of use. Most importantly, the framework should make your machine learning workflow easier to understand and faster to implement.

Below you can see the basic machine learning workflow and how it maps to existing packages from tidymodels:

![](tidymodels_process.svg)

1. Preprocess: Transform and prepare data for modeling -> `recipes`

2. Model: Select and specify a model for a specific problem -> `parsnip`

3. Measure: Evaluate the performance of the model -> `yardstick`

4. Sample: Split and sample input data to evaluate models -> `rsample`

5. Tune: Adjust and improve the model on input data -> `tune`


## Overview

![](https://www.tidymodels.org/start/resampling/img/resampling.svg)


We illustrate the typical machine learning workflow for regression problems using the `Hitters` data set from R `ISLR2` package. The steps are

1. Initial splitting to test and non-test sets.

2. Pre-processing of data (pipeline in Python, recipe in R).

3. Choose a learner/method Lasso in this example.

4. Tune the hyper-parameter(s) ($\lambda$ in this example) using $K$-fold cross-validation (CV) on the non-test data.

5. Choose the best model by CV and refit it on the whole non-test data.

6. Final prediction on the test data.

These steps completes the process of training and evaluating **one** machine learning method (lasso in this case). We repeat the same process for other learners, e.g., random forest or neural network, using the same test/non-test and CV split. The final report compares the learners based on CV and test errors.

## Hitters data

![](https://cdn.shopify.com/s/files/1/1878/8625/products/5ebb9f1f-776a-4fc4-b010-8a4c5b844aa2_1.f33650b4d5ee5c4b2e3856ebcbedaf52_1024x1024@2x.jpg)

A documentation of the `Hitters` data is [here](https://www.rdocumentation.org/packages/ISLR2/versions/1.3-2/topics/Hitters). The goal is to predict the salary (at opening of 1987 season) of MLB players from their performance metrics in the 1986-7 season.

::: {.panel-tabset}
#### R

```{r}
library(GGally)
library(ISLR2)
library(tidymodels)
library(tidyverse)

Hitters <- as_tibble(Hitters) %>% print(width = Inf)

# Numerical summaries
summary(Hitters)

```
Graphical summary takes longer to run so suppressed here.
```{r}
#| eval: false
# Graphical summaries
ggpairs(
  data = Hitters, 
  mapping = aes(alpha = 0.25), 
  lower = list(continuous = "smooth")
  ) + 
  labs(title = "Hitters Data")
```

There are 59 `NA`s for the salary. Let's drop those cases. We are left with 263 data points.
```{r}
Hitters <- Hitters %>%
  drop_na()
dim(Hitters)
```

#### Python

```{python}
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 2)
# Display all columns
pd.set_option('display.max_columns', None)

Hitters = pd.read_csv("../data/Hitters.csv")
Hitters
```

```{python}
# Numerical summaries
Hitters.info()
Hitters.describe()
```
Graphical summary takes longer to run so suppressed here.
```{python}
#| eval: false
# Graphical summaries
sns.pairplot(data = Hitters)
```

There are 59 `NA`s for the salary. Let's drop those cases. We are left with 263 data points.
```{python}
Hitters.dropna(inplace = True)
Hitters.shape
```


:::


## Initial split into test and non-test sets

::: {.panel-tabset}


#### R

```{r}
# For reproducibility
set.seed(425)
data_split <- initial_split(
  Hitters, 
  # # stratify by percentilesk
  # strata = "Salary", 
  prop = 0.75
  )

Hitters_other <- training(data_split)
dim(Hitters_other)
Hitters_test <- testing(data_split)
dim(Hitters_test)
```

#### Python

```{python}
from sklearn.model_selection import train_test_split

Hitters_other, Hitters_test = train_test_split(
  Hitters, 
  train_size = 0.75,
  random_state = 425, # seed
  )
Hitters_test.shape
Hitters_other.shape
```

Separate $X$ and $y$.
```{python}
# Non-test X and y
X_other = Hitters_other.drop('Salary', axis = 1)
y_other = Hitters_other.Salary
# Test X and y
X_test = Hitters_test.drop('Salary', axis = 1)
y_test = Hitters_test.Salary
```

:::

## Preprocessing (Python) or recipe (R)

For regularization methods such as ridge and lasso, it is essential to center and scale predictors. 

::: {.panel-tabset}


#### R

```{r}
norm_recipe <- 
  recipe(
    Salary ~ ., 
    data = Hitters_other
  ) %>%
  # create traditional dummy variables
  step_dummy(all_nominal()) %>%
  # zero-variance filter
  step_zv(all_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_predictors()) # %>%
  # step_log(Salary, base = 10) %>%
  # estimate the means and standard deviations
  # prep(training = Hitters_other, retain = TRUE)
norm_recipe
```

#### Python

Pre-processor for one-hot coding of categorical variables and then standardizing all numeric predictors.
```{python}
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

# OHE transformer for categorical variables
cattf = make_column_transformer(
  (OneHotEncoder(drop = 'first'), ['League', 'Division', 'NewLeague']),
  remainder = 'passthrough'
)
# Standardization transformer
scalar = StandardScaler()
```


:::

## Model

- We use the `glmnet` package in R for ridge and lasso regression.
- `parsnip`: <https://parsnip.tidymodels.org/reference/linear_reg.html> for R.

::: {.panel-tabset}

We use lasso in this example. See tutorial here for `glmnet` in R: <https://glmnet.stanford.edu/articles/glmnet.html>


#### R

```{r}
lasso_mod <- 
  # mixture = 0 (ridge), mixture = 1 (lasso)
  linear_reg(penalty = tune(), mixture = 1.0) %>% 
  set_engine("glmnet")
lasso_mod
```

#### Python

```{python}
from sklearn.linear_model import Lasso

lasso = Lasso(max_iter = 10000)
lasso
```


:::

## Pipeline (Python) or workflow (R)

Here we bundle the preprocessing step (Python) or recipe (R) and model.

::: {.panel-tabset}


#### R

```{r}
lr_wf <- 
  workflow() %>%
  add_model(lasso_mod) %>%
  add_recipe(norm_recipe)
lr_wf
```

#### Python

```{python}
pipe = Pipeline(steps = [
  ("cat_tf", cattf),
  ("std_tf", scalar), 
  ("model", lasso)
  ])
pipe
```

:::

## Tuning grid

Set up the grid for tuning in the range of $10^{-2}-10^3$.

::: {.panel-tabset}


#### R
<https://dials.tidymodels.org/reference/dials-package.html>
```{r}
lambda_grid <-
  grid_regular(penalty(range = c(-2, 1.5), trans = log10_trans()), levels = 100)
lambda_grid
```

#### Python

```{python}
# Tune hyper-parameter(s)
alphas = np.logspace(start = -3, stop = 2, base = 10, num = 100)
tuned_parameters = {"model__alpha": alphas}
```

:::

## Cross-validation (CV)

::: {.panel-tabset}


#### R

Set cross-validation partitions.
```{r}
set.seed(250)
folds <- vfold_cv(Hitters_other, v = 10)
folds
```

Fit cross-validation.
```{r, warning=FALSE, message=F}
lasso_fit <- 
  lr_wf %>%
  tune_grid(
    resamples = folds,
    grid = lambda_grid)
lasso_fit
```

Visualize CV criterion.
```{r}
lasso_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  labs(x = "Penalty", y = "CV RMSE") + 
  scale_x_log10(labels = scales::label_number())
```

Show the top 5 models ($\lambda$ values)
```{r}
lasso_fit %>%
  show_best(metric = "rmse")
```

Let's select the best model
```{r}
best_lasso <- lasso_fit %>%
  select_best(metric = "rmse")
best_lasso
```

#### Python

Set up CV partitions and CV criterion.
```{python}
from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 10
search = GridSearchCV(
  pipe, 
  tuned_parameters, 
  cv = n_folds, 
  scoring = "neg_root_mean_squared_error",
  # Refit the best model on the whole data set
  refit = True 
  )
```

Fit CV. This is typically the most time-consuming step.
```{python}
# Fit CV
search.fit(X_other, y_other)
```

CV results.
```{python}
cv_res = pd.DataFrame({
  "alpha": alphas,
  "rmse": -search.cv_results_["mean_test_score"]
  })

plt.figure()
sns.relplot(
  data = cv_res,
  x = "alpha",
  y = "rmse"
  ).set(
    xlabel = "alpha",
    ylabel = "CV RMSE",
    xscale = "log"
);
plt.show()
```

Best CV RMSE:
```{python}
-search.best_score_
```

:::

## Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

::: {.panel-tabset}


#### R

```{r}
# Final workflow
final_wf <- lr_wf %>%
  finalize_workflow(best_lasso)
final_wf

# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit

# Test metrics
final_fit %>% collect_metrics()
```

#### Python

Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available.
```{python}
search.best_estimator_
```

The final prediction RMSE on the test set is
```{python}
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, search.best_estimator_.predict(X_test))
```

Test RMSE seems to be a bit off.
:::