--- title: "Machine Learning Workflow: SVM with RBF Kernel (Heart Data)" subtitle: "Biostat 203B" author: "Dr. Hua Zhou @ UCLA" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Display system information for reproducibility. ::: {.panel-tabset} #### R ```{r} sessionInfo() ``` #### Python ```{python} import IPython print(IPython.sys_info()) ``` ::: ## Overview ![](https://www.tidymodels.org/start/resampling/img/resampling.svg) We illustrate the typical machine learning workflow for support vector machines using the `Heart` data set. The outcome is `AHD` (`Yes` or `No`). 1. Initial splitting to test and non-test sets. 2. Pre-processing of data: dummy coding categorical variables, standardizing numerical variables, imputing missing values, ... 3. Tune the SVM algorithm using 5-fold cross-validation (CV) on the non-test data. 4. Choose the best model by CV and refit it on the whole non-test data. 5. Final classification on the test data. ## Heart data The goal is to predict the binary outcome `AHD` (`Yes` or `No`) of patients. ::: {.panel-tabset} #### R ```{r} # Load libraries library(GGally) library(gtsummary) library(kernlab) library(tidyverse) library(tidymodels) # Load the `Heart.csv` data. Heart <- read_csv("Heart.csv") %>% # first column is patient ID, which we don't need select(-1) %>% # RestECG is categorical with value 0, 1, 2 mutate(RestECG = as.character(RestECG)) %>% print(width = Inf) # Numerical summaries stratified by the outcome `AHD`. Heart %>% tbl_summary(by = AHD) # Graphical summary: # Heart %>% ggpairs() ``` #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font sizes in plots sns.set(font_scale = 1.2) # Display all columns pd.set_option('display.max_columns', None) Heart = pd.read_csv("Heart.csv") Heart ``` ```{python} # Numerical summaries Heart.describe(include = 'all') ``` Graphical summary: ```{python} #| eval: false # Graphical summaries plt.figure() sns.pairplot(data = Heart); plt.show() ``` ::: ## Initial split into test and non-test sets We randomly split the data into 25% test data and 75% non-test data. Stratify on `AHD`. ::: {.panel-tabset} #### R ```{r} # For reproducibility set.seed(203) data_split <- initial_split( Heart, # stratify by AHD strata = "AHD", prop = 0.75 ) data_split Heart_other <- training(data_split) dim(Heart_other) Heart_test <- testing(data_split) dim(Heart_test) ``` #### Python ```{python} from sklearn.model_selection import train_test_split Heart_other, Heart_test = train_test_split( Heart, train_size = 0.75, random_state = 425, # seed stratify = Heart.AHD ) Heart_test.shape Heart_other.shape ``` Separate $X$ and $y$. We will use 13 features. ```{python} num_features = ['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca'] cat_features = ['ChestPain', 'Thal'] features = np.concatenate([num_features, cat_features]) # Non-test X and y X_other = Heart_other[features] y_other = Heart_other.AHD == 'Yes' # Test X and y X_test = Heart_test[features] y_test = Heart_test.AHD == 'Yes' ``` ::: ## Recipe (R) and Preprocessing (Python) - A data dictionary (roughly) is at . - We have following features: - Numerical features: `Age`, `RestBP`, `Chol`, `Slope` (1, 2 or 3), `MaxHR`, `ExAng`, `Oldpeak`, `Ca` (0, 1, 2 or 3). - Categorical features coded as integer: `Sex` (0 or 1), `Fbs` (0 or 1), `RestECG` (0, 1 or 2). - Categorical features coded as string: `ChestPain`, `Thal`. - There are missing values in `Ca` and `Thal`. Since missing proportion is not high, we will use simple mean (for numerical feature `Ca`) and mode (for categorical feature `Thal`) imputation. ::: {.panel-tabset} #### R ```{r} svm_recipe <- recipe( AHD ~ ., data = Heart_other ) %>% # mean imputation for Ca step_impute_mean(Ca) %>% # mode imputation for Thal step_impute_mode(Thal) %>% # create traditional dummy variables (necessary for svm) step_dummy(all_nominal_predictors()) %>% # zero-variance filter step_zv(all_numeric_predictors()) %>% # center and scale numeric data step_normalize(all_numeric_predictors()) %>% # estimate the means and standard deviations prep(training = Heart_other, retain = TRUE) svm_recipe ``` #### Python There are missing values in `Ca` (quantitative) and `Thal` (qualitative) variables. We are going to use simple `mean` imputation for `Ca` and `most_frequent` imputation for `Thal`. This is suboptimal. Better strategy is to use multiple imputation. ```{python} # How many NaNs Heart.isna().sum() ``` ```{python} from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # Transformer for categorical variables categorical_tf = Pipeline(steps = [ ("cat_impute", SimpleImputer(strategy = 'most_frequent')), ("encoder", OneHotEncoder(drop = 'first')) ]) # Transformer for continuous variables numeric_tf = Pipeline(steps = [ ("num_impute", SimpleImputer(strategy = 'mean')), ]) # Column transformer col_tf = ColumnTransformer(transformers = [ ('num', numeric_tf, num_features), ('cat', categorical_tf, cat_features) ]) ``` ::: ## Model ::: {.panel-tabset} #### R ```{r} svm_mod <- svm_rbf( mode = "classification", cost = tune(), rbf_sigma = tune() ) %>% set_engine("kernlab") svm_mod ``` #### Python ```{python} from sklearn.svm import SVC svm_mod = SVC( C = 1.0, kernel = 'rbf', gamma = 'scale', # 1 / (n_features * X.var()) probability = True, random_state = 425 ) ``` ::: ## Workflow in R and pipeline in Python Here we bundle the preprocessing step (Python) or recipe (R) and model. ::: {.panel-tabset} #### R ```{r} svm_wf <- workflow() %>% add_recipe(svm_recipe) %>% add_model(svm_mod) svm_wf ``` #### Python ```{python} from sklearn.pipeline import Pipeline pipe = Pipeline(steps = [ ("col_tf", col_tf), ("model", svm_mod) ]) pipe ``` ::: ## Tuning grid ::: {.panel-tabset} #### R Here we tune the `cost` and radial scale `rbf_sigma`. ```{r} param_grid <- grid_regular( cost(range = c(-8, 5)), rbf_sigma(range = c(-5, -3)), levels = c(14, 5) ) param_grid ``` #### Python Here we tune the inverse penalty strength `C` and the mixture parameter `l1_ratio`. ```{python} # Tune hyper-parameter(s) max_depth_grid = [1, 2, 3] C_grid = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0] gamma_grid = np.logspace(start = -8, stop = -2, base = 10, num = 10) tuned_parameters = { "model__C": C_grid, "model__gamma": gamma_grid } tuned_parameters ``` ::: ## Cross-validation (CV) ::: {.panel-tabset} #### R Set cross-validation partitions. ```{r} set.seed(203) folds <- vfold_cv(Heart_other, v = 5) folds ``` Fit cross-validation. ```{r} svm_fit <- svm_wf %>% tune_grid( resamples = folds, grid = param_grid, metrics = metric_set(roc_auc, accuracy) ) svm_fit ``` Visualize CV results: ```{r} svm_fit %>% collect_metrics() %>% print(width = Inf) %>% filter(.metric == "roc_auc") %>% ggplot(mapping = aes(x = cost, y = mean, alpha = rbf_sigma)) + geom_point() + labs(x = "Cost", y = "CV AUC") + scale_x_log10() ``` Show the top 5 models. ```{r} svm_fit %>% show_best("roc_auc") ``` Let's select the best model. ```{r} best_svm <- svm_fit %>% select_best("roc_auc") best_svm ``` #### Python Set up CV partitions and CV criterion. ```{python} from sklearn.model_selection import GridSearchCV # Set up CV n_folds = 5 search = GridSearchCV( pipe, tuned_parameters, cv = n_folds, scoring = "roc_auc", # Refit the best model on the whole data set refit = True ) ``` Fit CV. This is typically the most time-consuming step. ```{python} # Fit CV search.fit(X_other, y_other) ``` Visualize CV results. ```{python} #| eval: true #| code-fold: true cv_res = pd.DataFrame({ "C": np.array(search.cv_results_["param_model__C"]), "auc": search.cv_results_["mean_test_score"], "gamma": search.cv_results_["param_model__gamma"] }) plt.figure() sns.relplot( # kind = "line", data = cv_res, x = "gamma", y = "auc", hue = "C" ).set( xscale = "log", xlabel = "gamma", ylabel = "CV AUC" ); plt.show() ``` Best CV AUC: ```{python} search.best_score_ ``` The training accuracy is ```{python} from sklearn.metrics import accuracy_score, roc_auc_score accuracy_score( y_other, search.best_estimator_.predict(X_other) ) ``` ::: ## Finalize our model Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data. ::: {.panel-tabset} #### R ```{r} # Final workflow final_wf <- svm_wf %>% finalize_workflow(best_svm) final_wf ``` ```{r} # Fit the whole training set, then predict the test cases final_fit <- final_wf %>% last_fit(data_split) final_fit ``` ```{r} # Test metrics final_fit %>% collect_metrics() ``` #### Python Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available. ```{python} search.best_estimator_ ``` The final AUC on the test set is ```{python} roc_auc_score( y_test, search.best_estimator_.predict_proba(X_test)[:, 1] ) ``` The final classification accuracy on the test set is ```{python} accuracy_score( y_test, search.best_estimator_.predict(X_test) ) ``` :::