---
title: "Classfication Exercise: Credit Scoring Kaggle Competition"
author: 'Chicago Booth ML Team'
output: pdf_document
fontsize: 12
geometry: margin=0.6in
---


_**Note**: In order to illustrate the best practices in life (remember our dear prof McC's worldview: "in life" = "in R"), this script utilizes the popular [**caret**](http://topepo.github.io/caret) package, which wraps around underlying algorithms such as randomForest and GBM with a consistent interface. It's not hard to figure out how you could have written all this with the original randomForest / GBM packages. We also illutrate the use of **multi-core parallel computation** to speed up computer run-time (and, yes, salvage a bit of your laptop's subsequent eBay / Craigslist value...)._


#

This script illustrates the use of various algorithms to build **classification** models, using data from a [**Credit Scoring** competition on Kaggle.com](http://www.kaggle.com/c/GiveMeSomeCredit).


# Load Libraries & Modules; Set Randomizer Seed

```{r message=FALSE, warning=FALSE}
library(caret)
library(data.table)
library(doParallel)

# load modules from the common HelpR repo
helpr_repo_raw_url <- 'https://raw.githubusercontent.com/ChicagoBoothML/HelpR/master'
source(file.path(helpr_repo_raw_url, 'EvaluationMetrics.R'))

# set randomizer's seed
set.seed(99)   # Gretzky was #99
```


# Parallel Computation Setup

Let's set up a parallel computing infrastructure (thanks to the excellent **`doParallel`** package by Microsoft subsidiary **Revolution Analytics**) to allow more efficient computation in the rest of this exercise:

```{r message=FALSE, warning=FALSE, results='hide'}
cl <- makeCluster(detectCores() - 2)   # create a compute cluster using all CPU cores but 2
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)   # register this cluster
```

We have set up a compute cluster with **`r getDoParWorkers()`** worker nodes for computing.


# Data Import & Pre-Processing

```{r}
# download data and read data into data.table format
y_var_name <- 'SeriousDlqin2yrs'
X_var_names <- c(
  'RevolvingUtilizationOfUnsecuredLines',
  'age',
  'NumberOfTime30-59DaysPastDueNotWorse',
  'DebtRatio',
  'MonthlyIncome',
  'NumberOfOpenCreditLinesAndLoans',
  'NumberOfTimes90DaysLate',
  'NumberRealEstateLoansOrLines',
  'NumberOfTime60-89DaysPastDueNotWorse',
  'NumberOfDependents')

cs <- fread(
  'https://raw.githubusercontent.com/ChicagoBoothML/DATA___Kaggle___GiveMeSomeCredit/master/CreditScoring.csv',
  drop=c(1),   # drop column #1, which just contains row numbers
  colClasses=c(
    'SeriousDlqin2yrs'='integer',
    'RevolvingUtilizationOfUnsecuredLines'='numeric',
    'age'='numeric',
    'NumberOfTime30-59DaysPastDueNotWorse'='numeric',
    'DebtRatio'='numeric',
    'MonthlyIncome'='numeric',
    'NumberOfOpenCreditLinesAndLoans'='numeric',
    'NumberOfTimes90DaysLate'='numeric',
    'NumberRealEstateLoansOrLines'='numeric',
    'NumberOfTime60-89DaysPastDueNotWorse'='numeric',
    'NumberOfDependents'='numeric'),
  na.strings='NA')

cs[ , SeriousDlqin2yrs := factor(SeriousDlqin2yrs,
                                 levels=c(0, 1), labels=c("ok", "delinquent"))]

nb_samples <- nrow(cs)

cs
```

Just to sanity-check, the classes of the variables are:

```{r}
sapply(cs, class)
```

Out of the **`r formatC(nb_samples, format='d', big.mark=',')`** samples, the incidence of loan delinquency is **`r formatC(100 * sum(cs$SeriousDlqin2yrs == 'delinquent') / nb_samples, format='f', digits=2, big.mark=',')`%**. Note that this creates a "**skewed classes**" problem: one of the classes of cases (here the "delinquent" class) is significantly rarer than the other.

_(**note**: in more extreme cases where one class is much, much rarer than the other to the order of 1000 or 10,000 times, our model fitting procedures would need to be tweaked; but this case is not so extreme)_

Let's split the data into a Training set and a Test set:

```{r}
train_proportion <- .2
train_indices <- createDataPartition(
  y=cs$SeriousDlqin2yrs,
  p=train_proportion,
  list=FALSE)

cs_train <- cs[train_indices, ]
cs_test <- cs[-train_indices, ]
```

Let's also split a bit of data from the Training set as a Validation set for the purpose of estimating OOS performance metrics:

```{r}
valid_proportion_of_train <- 1 / 3
valid_indices <- createDataPartition(
  y=cs_train$SeriousDlqin2yrs,
  p=valid_proportion_of_train,
  list=FALSE)

cs_valid <- cs_train[valid_indices, ]
cs_train <- cs_train[-valid_indices, ]
```

Just to sanity-check that the data sets have been split representatively by **`caret`**: the delinquency incidences in the Training, Validation and Test sets are **`r formatC(100 * sum(cs_train$SeriousDlqin2yrs == 'delinquent') / nrow(cs_train), format='f', digits=2, big.mark=',')`**, **`r formatC(100 * sum(cs_valid$SeriousDlqin2yrs == 'delinquent') / nrow(cs_valid), format='f', digits=2, big.mark=',')`** and **`r formatC(100 * sum(cs_test$SeriousDlqin2yrs == 'delinquent') / nrow(cs_test), format='f', digits=2, big.mark=',')`** respectively.

The proportions of missing data points per column in the Training set are as follows:

```{r}
sapply(cs_train, function(col) sum(is.na(col))) / nrow(cs_train)
```

Let's not throw away valuable data just because of _N.A._ values; let's impute _N.A._ with the means of the respective columns in the _Training_ set:

```{r}
cs_train_mean_MonthlyIncome = mean(cs_train$MonthlyIncome, na.rm=TRUE)
cs_train_mean_NumberOfDependents = mean(cs_train$NumberOfDependents, na.rm=TRUE)

cs_train[is.na(MonthlyIncome), MonthlyIncome := cs_train_mean_MonthlyIncome]
cs_train[is.na(NumberOfDependents), NumberOfDependents := cs_train_mean_NumberOfDependents]
```


# Classification Models

Let's train 3 types of classification models: a Random Forest, a Boosted Trees model and a Logistic Regression.

```{r}
caret_optimized_metric <- 'logLoss'   # equivalent to 1 / 2 of Deviance

caret_train_control <- trainControl(
  classProbs=TRUE,             # compute class probabilities
  summaryFunction=mnLogLoss,   # equivalent to 1 / 2 of Deviance
  method='repeatedcv',         # repeated Cross Validation
  number=5,                    # 5 folds
  repeats=2,                   # 2 repeats
  allowParallel=TRUE)
```

```{r message=FALSE, warning=FALSE}
B <- 600

rf_model <- train(
  x=cs_train[, X_var_names, with=FALSE],
  y=cs_train$SeriousDlqin2yrs,
  method='parRF',     # parallel Random Forest
  metric=caret_optimized_metric,
  ntree=B,            # number of trees in the Random Forest
  nodesize=100,       # minimum node size set small enough to allow for complex trees,
                      # but not so small as to require too large B to eliminate high variance
  importance=TRUE,    # evaluate importance of predictors
  keep.inbag=TRUE,
  trControl=caret_train_control,
  tuneGrid=NULL)
```

```{r message=FALSE, warning=FALSE}
B <- 1200

boost_model <- train(
  x=cs_train[, X_var_names, with=FALSE],
  y=cs_train$SeriousDlqin2yrs,
  method='gbm',       # Generalized Boosted Models
  metric=caret_optimized_metric,
  verbose=FALSE,
  trControl=caret_train_control,
  tuneGrid=expand.grid(
    n.trees=B,              # number of trees
    interaction.depth=10,   # max tree depth,
    n.minobsinnode=100,     # minimum node size
    shrinkage=0.01))        # shrinkage parameter, a.k.a. "learning rate"
```

```{r message=FALSE, warning=FALSE}
log_reg_model <- train(
  x=cs_train[, X_var_names, with=FALSE],
  y=cs_train$SeriousDlqin2yrs,
  preProcess=c('center', 'scale'), 
  method='plr',       # Penalized Logistic Regression
  metric=caret_optimized_metric,
  trControl=caret_train_control,
  tuneGrid=expand.grid(
    lambda=0,   # weight penalty parameter
    cp='aic'))     # complexity parameter (AIC / BIC)
```

We'll now evaluate the OOS performances of these 3 models on the Validation set to select a model we think is best:

```{r}
low_prob <- 1e-6
high_prob <- 1 - low_prob
log_low_prob <- log(low_prob)
log_high_prob <- log(high_prob)
log_prob_thresholds <- seq(from=log_low_prob, to=log_high_prob, length.out=1000)
prob_thresholds <- exp(log_prob_thresholds)

# Fill missing value for Validation data
cs_valid[is.na(MonthlyIncome), MonthlyIncome := cs_train_mean_MonthlyIncome]
cs_valid[is.na(NumberOfDependents), NumberOfDependents := cs_train_mean_NumberOfDependents]

# *** NOTE: **
# the below "bin_classif_eval" function is from the "EvaluationMetrics.R" helper script
# in the "HelpR" GitHub repo

rf_pred_probs <- predict(
  rf_model, newdata=cs_valid[ , X_var_names, with=FALSE], type='prob')
rf_oos_performance <- bin_classif_eval(
  rf_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

boost_pred_probs <- predict(
  boost_model, newdata=cs_valid[ , X_var_names, with=FALSE], type='prob')
boost_oos_performance <- bin_classif_eval(
  boost_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

log_reg_pred_probs <- predict(
  log_reg_model, newdata=cs_valid[, X_var_names, with=FALSE], type='prob')
log_reg_oos_performance <- bin_classif_eval(
  log_reg_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)


plot(x=1 - rf_oos_performance$specificity,
     y=rf_oos_performance$sensitivity,
     type = "l", col='darkgreen', lwd=3,
     xlim = c(0., 1.), ylim = c(0., 1.),
     main = "ROC Curves (Validation Data)",
     xlab = "1 - Specificity", ylab = "Sensitivity")
abline(a=0,b=1,lty=2,col=8)
lines(x=1 - boost_oos_performance$specificity,
      y=boost_oos_performance$sensitivity,
      col='green', lwd=3)
lines(x=1 - log_reg_oos_performance$specificity,
      y=log_reg_oos_performance$sensitivity,
      col='red', lwd=3)
legend('right', c('Random Forest', 'Boosted Trees', 'Logistic Regression'), 
   lty=1, col=c('darkgreen', 'green', 'red'), lwd=3, cex=1.)
```

It seems the Boosted Trees model offers the best classification performance frontier. We now need to pick a decision threshold for the Boosted Trees model. If we are to be really rigorous, we'll need to pose this trade-off in the context of a financial firm extending loans, e.g. balancing the costs of bad debt and the costs of auditing loans that are healthy. Here, to make life simple, we'll pick a subjective threshold that enables us to anticipate **75%** of the delinquency cases:

```{r}
sensitivity_threshold <- .75
i <- min(which(boost_oos_performance$sensitivity < sensitivity_threshold)) - 1
selected_prob_threshold <- prob_thresholds[i]
```

The selected decision threshold is **`r formatC(selected_prob_threshold, format='f', digits=3)`** &ndash; meaning when we use the Boosted Tree model to predict on new data, we'll predict "Delinquent" when the predicted probability exceeds that threshold. The expected performance of the model at that threshold is as follows:

```{r}
boost_oos_performance[i, ]
```

Note that there is trade-off: the precision of the model at this sensitivity threshold is rather low, meaning that there'll be many false positives, i.e. cases with lower financial risk being classified as likely to be delinquent.


# Test Performance of Selected Model

Remember that the Test data may have some missing values; let's first impute those missing values by the relevant means we've derived from the Training data:

```{r}
cs_test[is.na(MonthlyIncome), MonthlyIncome := cs_train_mean_MonthlyIncome]
cs_test[is.na(NumberOfDependents), NumberOfDependents := cs_train_mean_NumberOfDependents]
```

Let's then evaluate the performance of the selected Boosted Trees model, with a decision threshold at **`r formatC(selected_prob_threshold, format='f', digits=3)`**:

```{r}
boost_test_pred_probs <- predict(
  boost_model, newdata=cs_test[, X_var_names, with=FALSE], type='prob')

boost_test_performance <- bin_classif_eval(
  boost_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=selected_prob_threshold)
boost_test_performance
```

We can see that the Test performance is very similar to what we've estimated from the Validation set. The selected model works as expected.


```{r}
stopCluster(cl)   # shut down the parallel computing cluster
```