---
title: "Machine Learning Workflow: SVM with Polynomial Kernel (Heart Data)"
subtitle: "Biostat 212A"
author: "Dr. Jin Zhou @ UCLA"
date: "`r format(Sys.time(), '%d %B, %Y')`"
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

Display system information for reproducibility.

::: panel-tabset
#### R

```{r}
sessionInfo()
```


:::

## Overview

![](https://www.tidymodels.org/start/resampling/img/resampling.svg)

We illustrate the typical machine learning workflow for support vector machines using the `Heart` data set. The outcome is `AHD` (`Yes` or `No`).

1.  Initial splitting to test and non-test sets.

2.  Pre-processing of data: dummy coding categorical variables, standardizing numerical variables, imputing missing values, ...

3.  Tune the SVM algorithm using 5-fold cross-validation (CV) on the non-test data.

4.  Choose the best model by CV and refit it on the whole non-test data.

5.  Final classification on the test data.

## Heart data

The goal is to predict the binary outcome `AHD` (`Yes` or `No`) of patients.

::: panel-tabset
#### R

```{r}
# Load libraries
library(GGally)
library(gtsummary)
library(kernlab)
library(tidyverse)
library(tidymodels)

# Load the `Heart.csv` data.
Heart <- read_csv("../data/Heart.csv") %>% 
  # first column is patient ID, which we don't need
  select(-1) %>%
  # RestECG is categorical with value 0, 1, 2
  mutate(RestECG = as.character(RestECG)) %>%
  print(width = Inf)

# Numerical summaries stratified by the outcome `AHD`.
Heart %>% tbl_summary(by = AHD)

# Graphical summary:
# Heart %>% ggpairs()
```
:::

## Initial split into test and non-test sets

We randomly split the data into 25% test data and 75% non-test data. Stratify on `AHD`.

::: panel-tabset
#### R

```{r}
# For reproducibility
set.seed(203)

data_split <- initial_split(
  Heart, 
  # stratify by AHD
  strata = "AHD", 
  prop = 0.75
  )
data_split

Heart_other <- training(data_split)
dim(Heart_other)

Heart_test <- testing(data_split)
dim(Heart_test)
```


:::

## Recipe (R) and Preprocessing (Python)

-   A data dictionary (roughly) is at <https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/>.

-   We have following features:

    -   Numerical features: `Age`, `RestBP`, `Chol`, `Slope` (1, 2 or 3), `MaxHR`, `ExAng`, `Oldpeak`, `Ca` (0, 1, 2 or 3).

    -   Categorical features coded as integer: `Sex` (0 or 1), `Fbs` (0 or 1), `RestECG` (0, 1 or 2).

    -   Categorical features coded as string: `ChestPain`, `Thal`.

-   There are missing values in `Ca` and `Thal`. Since missing proportion is not high, we will use simple mean (for numerical feature `Ca`) and mode (for categorical feature `Thal`) imputation.

::: panel-tabset
#### R

```{r}
svm_recipe <- 
  recipe(
    AHD ~ ., 
    data = Heart_other
  ) %>%
  # mean imputation for Ca
  step_impute_mean(Ca) %>%
  # mode imputation for Thal
  step_impute_mode(Thal) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) # %>%
  # estimate the means and standard deviations
  # prep(training = Heart_other, retain = TRUE)
svm_recipe
```


:::

## Model

::: panel-tabset
#### R

```{r}
svm_mod <- 
  svm_poly(
    mode = "classification",
    cost = tune(),
    degree = tune(),
    # scale_factor = tune()
  ) %>% 
  set_engine("kernlab")
svm_mod
```
:::

## Workflow in R and pipeline in Python

Here we bundle the preprocessing step (Python) or recipe (R) and model.

::: panel-tabset
#### R

```{r}
svm_wf <- workflow() %>%
  add_recipe(svm_recipe) %>%
  add_model(svm_mod)
svm_wf
```
:::

## Tuning grid

::: panel-tabset
#### R

Here we tune the `cost` and radial scale `rbf_sigma`.

```{r}
param_grid <- grid_regular(
  cost(range = c(-3, 2)),
  degree(range = c(1, 5)),
  #scale_factor(range = c(-1, 1)),
  levels = c(5)
  )
param_grid
```
:::

## Cross-validation (CV)

::: panel-tabset
#### R

Set cross-validation partitions.

```{r}
set.seed(203)

folds <- vfold_cv(Heart_other, v = 5)
folds
```

Fit cross-validation.

```{r}
svm_fit <- svm_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )
svm_fit
```

Visualize CV results:

```{r}
svm_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc" ) %>%
  ggplot(mapping = aes(x = degree, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Cost", y = "CV AUC") +
  scale_x_log10()
```

Show the top 5 models.

```{r}
svm_fit %>%
  show_best(metric = "roc_auc")
```

Let's select the best model.

```{r}
best_svm <- svm_fit %>%
  select_best(metric ="roc_auc")
best_svm
```
:::

## Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

::: panel-tabset
#### R

```{r}
# Final workflow
final_wf <- svm_wf %>%
  finalize_workflow(best_svm)
final_wf
```

```{r}
# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit
```

```{r}
# Test metrics
final_fit %>% 
  collect_metrics()
```


:::

```{r, eval = T}
library(doParallel)
set.seed(101)
split_obj <- initial_split(data = Heart, prop = 0.7, strata = AHD)
train <- training(split_obj)
test <- testing(split_obj)


# Create the recipe
recipe(AHD ~ ., data = train) %>%
  # mean imputation for Ca
  step_impute_mean(Ca) %>%
  # mode imputation for Thal
  step_impute_mode(Thal) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep() -> recipe_obj

# Bake
train <- bake(recipe_obj, new_data=train)
test <- bake(recipe_obj, new_data=test)
```

```{r}
library(vip)
final_fit %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  # vip(method = "permute", train= Heart)
  vip(method = "permute", 
      target = "AHD", metric = "accuracy",
      pred_wrapper = kernlab::predict, train = train)
```


```{r}

svm_rbf_spec <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_rbf_fit <- svm_rbf_spec %>%
  fit(AHD ~ ., data = train[, c('Ca', 'ExAng', 'AHD')])

svm_rbf_fit %>%
  extract_fit_engine() %>%
  plot()
```