---
title: 'Used Cars: Homework 03'
author: 'Chicago Booth ML Team'
output: pdf_document
fontsize: 12
geometry: margin=0.6in
---


_**Note**: In order to illustrate the best practices, this homework answers script utilizes the popular [**caret**](http://topepo.github.io/caret) package, which wraps around underlying algorithms such as randomForest and GBM with a consistent interface. It's not hard to figure out how you could have written all this with the original randomForest / GBM packages. We also illutrate the use of **multi-core parallel computation** to speed up computer run-time (and, yes, salvage a bit of your laptop's subsequent eBay / Craigslist value...)._


# Load Libraries & Modules; Set Randomizer Seed

```{r message=FALSE, warning=FALSE}
library(caret)
library(data.table)
library(doParallel)
library(tree)

# load modules from the common HelpR repo
helpr_repo_raw_url <- 'https://raw.githubusercontent.com/ChicagoBoothML/HelpR/master'
source(file.path(helpr_repo_raw_url, 'EvaluationMetrics.R'))

# set randomizer's seed
set.seed(99)   # Gretzky was #99
```


# Parallel Computation Setup

Let's set up a parallel computing infrastructure (thanks to the excellent **`doParallel`** package by Microsoft subsidiary **Revolution Analytics**) to allow more efficient computation in the rest of this exercise:

```{r message=FALSE, warning=FALSE, results='hide'}
cl <- makeCluster(detectCores() - 2)   # create a compute cluster using all CPU cores but 2
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)   # register this cluster
```

We have set up a compute cluster with **`r getDoParWorkers()`** worker nodes for computing.


# Data Import

```{r}
# download data and read data into data.table format
used_cars <- fread(
  'https://raw.githubusercontent.com/ChicagoBoothML/DATA___UsedCars/master/UsedCars.csv',
  stringsAsFactors=TRUE,
  colClasses=c(price='numeric',
               mileage='numeric',
               year='numeric'))

used_cars[, displacement := as.numeric(as.character(displacement))]

# count number of samples
nb_samples <- nrow(used_cars)

used_cars
```

Just to sanity-check, the classes of the variables are:

```{r}
sapply(used_cars, class)
```


Let's now split the data set into a Training set for fitting models and a Test set for evaluating them:

_(**note**: here we shall skip splitting a Validation set because we can use Out-of-Bag and Cross Validation RMSE estimates)_

```{r}
train_proportion <- .8
train_indices <- createDataPartition(y=used_cars$price,
                                     p=train_proportion,
                                     list=FALSE)

used_cars_train <- used_cars[train_indices, ]
used_cars_test <- used_cars[-train_indices, ]
```

To sanity-check the representativeness of the split, we can examine the distributions of the _price_ variable in the three data sets:

```{r}
hist(used_cars$price)
hist(used_cars_train$price)
hist(used_cars_test$price)
```


# Models with 2 Predictor Variables _mileage_ & _year_

## Single Tree models

Let's try a rather small tree:

```{r}
mincut <- 3000    # smallest allowed node size
minsize <- 6000   # minimum number of observations to include in either child node; NOTE: minsize >= 2x mincut
mindev <- 1e-6    # minimum deviance gain for further tree split

tree_2vars_small <- tree(price ~ mileage + year, data=used_cars_train,
                         mincut=mincut, minsize=minsize, mindev=mindev)

test_rmse_tree_2vars_small <- rmse(
  y_hat=predict(tree_2vars_small, newdata=used_cars_test),
  y=used_cars_test$price)
```

This small tree model has an OOS RMSE of $**`r formatC(test_rmse_tree_2vars_small, format='f', digits=0, big.mark=',')`**.

And a big tree:

```{r}
mincut <- 3   # smallest allowed node size
minsize <- 6   # minimum number of observations to include in either child node; NOTE: minsize >= 2x mincut
mindev <- 1e-6   # minimum deviance gain for further tree split

tree_2vars_big <- tree(price ~ mileage + year, data=used_cars_train,
                       mincut=mincut, minsize=minsize, mindev=mindev)

test_rmse_tree_2vars_big <- rmse(
  y_hat=predict(tree_2vars_big, newdata=used_cars_test),
  y=used_cars_test$price)
```

This big tree has an OOS RMSE of $**`r formatC(test_rmse_tree_2vars_big, format='f', digits=0, big.mark=',')`**.


## Random Forest model

```{r message=FALSE, warning=FALSE, results='hide'}
B <- 300   # number of trees in the Random Forest

rf_2vars <- train(
  price ~ mileage + year,
  data=used_cars_train,
  method='parRF',  # parallel Random Forest
  ntree=B,         # number of trees in the Random Forest
  nodesize=30,     # minimum node size set small enough to allow for complex trees,
                   # but not so small as to require too large B to eliminate high variance
  importance=TRUE, # evaluate importance of predictors
  keep.inbag=TRUE,
  trControl=trainControl(
    method='oob',  # Out-of-Bag RMSE estimation
    allowParallel=TRUE),
  tuneGrid=NULL)

test_rmse_rf_2vars <- rmse(
  y_hat=predict(rf_2vars, newdata=used_cars_test),
  y=used_cars_test$price)
```

This Random Forest model has an estimated OOB RMSE of $**`r formatC(min(rf_2vars$results$RMSE), format='f', digits=0, big.mark=',')`** based on the Training set, and a Test-set OOS RMSE of $**`r formatC(test_rmse_rf_2vars, format='f', digits=0, big.mark=',')`**.


## Boosted Trees model

```{r message=FALSE, warning=FALSE}
B <- 1000

boost_2vars <- train(
  price ~ mileage + year,
  data=used_cars_train,
  method='gbm',           # Generalized Boosted Models
  verbose=FALSE,
  trControl=trainControl(
    method='repeatedcv',  # repeated Cross Validation
    number=5,             # number of CV folds
    repeats=3,            # number of CV repeats
    allowParallel=TRUE),
  tuneGrid=expand.grid(
    n.trees=B,            # number of trees
    interaction.depth=5,  # max tree depth,
    n.minobsinnode=100,   # minimum node size
    shrinkage=.01))       # shrinkage parameter, a.k.a. "learning rate"

test_rmse_boost_2vars <- rmse(
  y_hat=predict(boost_2vars, newdata=used_cars_test),
  y=used_cars_test$price)
```

This Boosted Trees model has an estimated OOS RMSE of **`r formatC(boost_2vars$results$RMSE, format='f', digits=0, big.mark=',')`** based on the Training set, and a Test-set OOS RMSE of **`r formatC(test_rmse_boost_2vars, format='f', digits=0, big.mark=',')`**.


# Models with All Predictor Variables 

Let's not even mess around with single trees and go straight to building Random Forest & Boosted Trees models predicting _price_ using all other variables.


## Random Forest model

```{r}
B <- 300   # number of trees in the Random Forest

rf_manyvars <- train(
  price ~ .,
  data=used_cars_train,
  method='parRF',  # parallel Random Forest
  ntree=B,         # number of trees in the Random Forest
  nodesize=30,     # minimum node size set small enough to allow for complex trees,
                   # but not so small as to require too large B to eliminate high variance
  importance=TRUE, # evaluate importance of predictors
  keep.inbag=TRUE,
  trControl=trainControl(
    method='oob',  # Out-of-Bag RMSE estimation
    allowParallel=TRUE),
  tuneGrid=NULL)

test_rmse_rf_manyvars <- rmse(
  y_hat=predict(rf_manyvars, newdata=used_cars_test),
  y=used_cars_test$price)
```

This many-variable Random Forest model has an estimated OOB RMSE of $**`r formatC(min(rf_manyvars$results$RMSE), format='f', digits=0, big.mark=',')`** based on the Training set, and a Test-set OOS RMSE of $**`r formatC(test_rmse_rf_manyvars, format='f', digits=0, big.mark=',')`**.


## Boosted Trees model

```{r}
B <- 1000

boost_manyvars <- train(
  price ~ .,
  data=used_cars_train,
  method='gbm',           # Generalized Boosted Models
  verbose=FALSE,
  trControl=trainControl(
    method='repeatedcv',  # repeated Cross Validation
    number=5,             # number of CV folds
    repeats=3,            # number of CV repeats
    allowParallel=TRUE),
  tuneGrid=expand.grid(
    n.trees=B,            # number of trees
    interaction.depth=5,  # max tree depth,
    n.minobsinnode=100,   # minimum node size
    shrinkage=.01))       # shrinkage parameter, a.k.a. "learning rate"

test_rmse_boost_manyvars <- rmse(
  y_hat=predict(boost_manyvars, newdata=used_cars_test),
  y=used_cars_test$price)
```

This Boosted Trees model has an estimated OOS RMSE of **`r formatC(boost_manyvars$results$RMSE, format='f', digits=0, big.mark=',')`** based on the Training set, and a Test-set OOS RMSE of **`r formatC(test_rmse_boost_manyvars, format='f', digits=0, big.mark=',')`**.


Overall, this exercise shows the power of simple but extremely flexible trees-based methods such as Random Forest and Boosted Trees. When having many variables, all we have to do is to throw them into a trees ensemble!

```{r}
stopCluster(cl)   # shut down the parallel computing cluster
```