--- title: "Machine Learning Workflow: Classifiers With Nonlinear Features" subtitle: "Econ 425T" author: "Dr. Hua Zhou @ UCLA" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Display system information for reproducibility. ::: {.panel-tabset} #### Python ```{python} import IPython print(IPython.sys_info()) ``` #### R ```{r} sessionInfo() ``` ::: ## Overview ![](https://www.tidymodels.org/start/resampling/img/resampling.svg) We illustrate the typical machine learning workflow for regression problems using the `Default` data set from R `ISLR2` package. Our goal is to classify whether a credit card customer will default or not. The steps are 1. Initial splitting to test and non-test sets. 2. Pre-processing of data: one-hot-encoder for categorical variables, add **nonlinear features** (B-splines) for some continuous predictors. 3. Choose **a set of candidate classifiers**: logistic regression, LDA, QDA, NB, KNN. 4. Tune the hyper-parameter(s) (`n_knots` in SplineTransformer, classifier) using 10-fold cross-validation (CV) on the non-test data. 5. Choose the best model by CV and refit it on the whole non-test data. 6. Final classification on the test data. ## Default data A documentation of the `Default` data is [here](https://www.rdocumentation.org/packages/ISLR2/versions/1.3-2/topics/Default). The goal is to classify whether a credit card customer will default or not. ::: {.panel-tabset} #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font sizes in plots sns.set(font_scale = 1.2) # Display all columns pd.set_option('display.max_columns', None) Default = pd.read_csv("../data/Default.csv") Default ``` ```{python} # Numerical summaries Default.describe() ``` Graphical summary: ```{python} #| eval: true # Graphical summaries plt.figure() sns.pairplot(data = Default); plt.show() ``` #### R ```{r} #| eval: true library(GGally) library(ISLR2) library(tidymodels) library(tidyverse) Default <- as_tibble(Default) %>% print(width = Inf) # Numerical summaries summary(Default) ``` Graphical summary. ```{r} #| eval: false # Graphical summaries ggpairs( data = Default, mapping = aes(alpha = 0.25), lower = list(continuous = "smooth") ) + labs(title = "Default Data") ``` ::: ## Initial split into test and non-test sets It is a good idea to keep the proportions of `default = 'Yes'` to be roughly same between test and non-test data. ::: {.panel-tabset} #### Python ```{python} from sklearn.model_selection import train_test_split Default_other, Default_test = train_test_split( Default, train_size = 0.75, random_state = 425, # seed stratify = Default.default ) Default_test.shape Default_other.shape ``` Separate $X$ and $y$. ```{python} # Non-test X and y X_other = Default_other[['balance', 'income', 'student']] y_other = Default_other.default # Test X and y X_test = Default_test[['balance', 'income', 'student']] y_test = Default_test.default ``` #### R ```{r} #| eval: false # For reproducibility set.seed(425) data_split <- initial_split( Wage, # # stratify by percentiles # strata = "Salary", prop = 0.75 ) Wage_other <- training(data_split) dim(Wage_other) Wage_test <- testing(data_split) dim(Wage_test) ``` ::: ## Preprocessing (Python) or recipe (R) ::: {.panel-tabset} #### Python Pre-processor for one-hot coding of categorical variables and then standardizing all numeric predictors. ```{python} from sklearn.preprocessing import OneHotEncoder, StandardScaler, SplineTransformer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # Transformer for numeric variables numeric_tf = Pipeline(steps = [ # ("scalar", StandardScaler()), ("bspline", SplineTransformer(degree = 3, extrapolation = 'linear')) ]) # Transformer for categorical variables categorical_tf = Pipeline(steps = [ ("encoder", OneHotEncoder(drop = 'first')) ]) # Column transformer col_tf = ColumnTransformer(transformers = [ ('num', numeric_tf, ['balance', 'income']), ('cat', categorical_tf, ['student']) ]) ``` #### R ```{r} #| eval: false norm_recipe <- recipe( wage ~ year + age + education, data = Wage_other ) %>% # create traditional dummy variables step_dummy(all_nominal()) %>% # zero-variance filter step_zv(all_predictors()) %>% # B-splines of age step_bs(age, deg_free = 5) %>% # B-splines of year step_bs(year, deg_free = 4) %>% # center and scale numeric data step_normalize(all_predictors()) %>% # estimate the means and standard deviations prep(training = Wage_other, retain = TRUE) norm_recipe ``` ::: ## Model Let's skip this step for now, because the model (or classifer) itself is being tuned by CV. ::: {.panel-tabset} #### Python #### R ```{r} #| eval: false enet_mod <- # mixture = 0 (ridge), mixture = 1 (lasso) linear_reg(penalty = tune(), mixture = tune()) %>% set_engine("glmnet") enet_mod ``` ::: ## Pipeline (Python) or workflow (R) Here we bundle the preprocessing step (Python) or recipe (R) and model. Again remember KNN is a placeholder here. Later we will choose the better classifier according to cross validation. ::: {.panel-tabset} #### Python ```{python} from sklearn.neighbors import KNeighborsClassifier pipe = Pipeline(steps = [ ("col_tf", col_tf), ("model", KNeighborsClassifier()) ]) pipe ``` #### R ```{r} #| eval: false lr_wf <- workflow() %>% add_model(enet_mod) %>% add_recipe(norm_recipe) lr_wf ``` ::: ## Tuning grid Set up the 2D grid for tuning. ::: {.panel-tabset} #### Python ```{python} from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB # Tune hyper-parameter(s) nknots_grid = np.array(range(2, 6)) classifers = [ LogisticRegression(), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), KNeighborsClassifier(n_neighbors = 5) ] tuned_parameters = { "col_tf__num__bspline__n_knots": nknots_grid, "model": classifers } tuned_parameters ``` #### R ```{r} #| eval: false param_grid <-grid_regular( penalty(range = c(-5, 0), trans = log10_trans()), mixture(range = c(0, 1)), levels = c(penalty = 50, mixture = 6) ) param_grid ``` ::: ## Cross-validation (CV) ::: {.panel-tabset} #### Python Set up CV partitions and CV criterion. Again it's a good idea to keep the case proportions roughly same between splits. According to the `GridSearchCV` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), `StratifiedKFold` is used automatically. ```{python} from sklearn.model_selection import GridSearchCV # Set up CV n_folds = 10 search = GridSearchCV( pipe, tuned_parameters, cv = n_folds, scoring = "roc_auc", # Refit the best model on the whole data set refit = True ) ``` Fit CV. This is typically the most time-consuming step. ```{python} # Fit CV search.fit(X_other, y_other) ``` Visualize CV results. ```{python} #| eval: true #| code-fold: true cv_res = pd.DataFrame({ "nknots": np.array(search.cv_results_["param_col_tf__num__bspline__n_knots"]), "classifier": np.array(search.cv_results_["param_model"]), "aucroc": search.cv_results_["mean_test_score"] }) plt.figure() sns.relplot( kind = "line", data = cv_res, x = "nknots", y = "aucroc", hue = "classifier" ).set( xlabel = "Number of Knots", ylabel = "CV AUC" ); plt.show() ``` Best CV AUC-ROC: ```{python} search.best_score_ ``` #### R Set cross-validation partitions. ```{r} #| eval: false set.seed(250) folds <- vfold_cv(Wage_other, v = 10) folds ``` Fit cross-validation. ```{r} #| eval: false enet_fit <- lr_wf %>% tune_grid( resamples = folds, grid = param_grid, ) enet_fit ``` Visualize CV criterion. ```{r} #| eval: false enet_fit %>% collect_metrics() %>% print(width = Inf) %>% filter(.metric == "rmse") %>% ggplot(mapping = aes(x = penalty, y = mean)) + geom_point() + geom_line(aes(group = mixture)) + labs(x = "Penalty", y = "CV RMSE") + scale_x_log10(labels = scales::label_number()) ``` Show the top 5 models ($\lambda$ values) ```{r} #| eval: false enet_fit %>% show_best("rmse") ``` Let's select the best model ```{r} #| eval: false best_enet <- enet_fit %>% select_best("rmse") best_enet ``` ::: ## Finalize our model Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data. ::: {.panel-tabset} #### Python Since we called `GridSearchCV` with `refit = True`, the best model fit on the whole non-test data is readily available. ```{python} search.best_estimator_ ``` The final AUC on the test set is ```{python} from sklearn.metrics import roc_auc_score roc_auc_score( y_test, search.best_estimator_.predict_proba(X_test)[:, 1] ) ``` #### R ```{r} #| eval: false # Final workflow final_wf <- lr_wf %>% finalize_workflow(best_enet) final_wf # Fit the whole training set, then predict the test cases final_fit <- final_wf %>% last_fit(data_split) final_fit # Test metrics final_fit %>% collect_metrics() ``` :::