Statistical learning: classification and cross-validation

MACS 30500 University of Chicago

Should I Have a Cookie?

Interpreting a decision tree

A more complex tree

A more complexier tree

Benefits/drawbacks to decision trees

  • Easy to explain
  • Easy to interpret/visualize
  • Good for qualitative predictors
  • Lower accuracy rates
  • Non-robust

Random forests

Sampling with replacement

## [1] 24.23151

LOOCV in linear regression

LOOCV in classification

titanic_loocv <- crossv_kfold(titanic, k = nrow(titanic))
titanic_models <- map(titanic_loocv$train, ~ glm(Survived ~ Age * Sex,
                                                 data = .,
                                                 family = binomial))
titanic_mse <- map2_dbl(titanic_models, titanic_loocv$test, mse.glm)
mean(titanic_mse, na.rm = TRUE)
## [1] 0.1703518

Exercise: LOOCV in linear regression

\(k\)-fold cross-validation

\[CV_{(k)} = \frac{1}{k} \sum_{i = 1}^{k}{MSE_i}\]

  • Split data into \(k\) folds
  • Repeat training/test process for each fold
  • LOOCV: \(k=n\)

k-fold CV in linear regression

cv10_data <- crossv_kfold(Auto, k = 10)

Computational speed of LOOCV

Computational speed of 10-fold CV

k-fold CV in logistic regression

titanic_kfold <- crossv_kfold(titanic, k = 10)
titanic_models <- map(titanic_kfold$train, ~ glm(Survived ~ Age * Sex,
                                                 data = .,
                                                 family = binomial))
titanic_mse <- map2_dbl(titanic_models, titanic_kfold$test, mse.glm)
mean(titanic_mse, na.rm = TRUE)
## [1] 0.1709727

Exercise: k-fold CV