--- title: "Homework #6: Trees and Random Forest" author: "**Your Name Here**" format: sys6018hw-html --- ```{r config, include=FALSE} # Set global configurations and settings here knitr::opts_chunk$set() # set global chunk options ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme ``` # Required R packages and Directories {.unnumbered .unlisted} ```{r packages, message=FALSE, warning=FALSE} data.dir = 'https://mdporter.github.io/DS6018/data/' # data directory library(tidyverse) # functions for data manipulation library(ranger) # fast random forest implementation library(modeldata) # for the ames housing data ``` # Problem 1: Tree Splitting for classification Consider the Gini index, classification error, and entropy impurity measures in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of $p_m$, the estimated probability of an observation in node $m$ being from class 1. The x-axis should display $p_m$, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. ::: {.callout-note title="Solution"} Solution here ::: # Problem 2: Combining bootstrap estimates ```{r, echo=FALSE} p_red = c(0.2, 0.25, 0.3, 0.4, 0.4, 0.45, 0.7, 0.85, 0.9, 0.9) ``` Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of $X$, produce the following 10 estimates of $\Pr(\text{Class is Red} \mid X=x)$: $\{`r stringr::str_c(p_red, sep=", ")`\}$. ## a. Majority Vote ISLR 8.2 describes the *majority vote* approach for making a hard classification from a set of bagged classifiers. What is the final classification for this example using majority voting? ::: {.callout-note title="Solution"} Solution here ::: ## b. Average Probability An alternative is to base the final classification on the average probability. What is the final classification for this example using average probability? ::: {.callout-note title="Solution"} Solution here ::: ## c. Unequal misclassification costs Suppose the cost of mis-classifying a Red observation (as Green) is twice as costly as mis-classifying a Green observation (as Red). How would you modify both approaches to make better final classifications under these unequal costs? Report the final classifications. ::: {.callout-note title="Solution"} Solution here ::: # Problem 3: Random Forest Tuning Random forest has several tuning parameters that you will explore in this problem. We will use the `ames` housing data from the `modeldata` R package. There are several R packages for Random Forest. The `ranger::ranger()` function is much faster than `randomForest::randomForest()` so we will use this one. ## a. Random forest (`ranger`) tuning parameters List all of the random forest tuning parameters in the `ranger::ranger()` function. You don't need to list the parameters related to computation, special models (e.g., survival, maxstat), or anything that won't impact the predictive performance. Indicate the tuning parameters you think will be most important to optimize? ::: {.callout-note title="Solution"} Solution here ::: ## b. Implement Random Forest Use a random forest model to predict the sales price, `Sale_Price`. Use the default parameters and report the 10-fold cross-validation RMSE (square root of mean squared error). ::: {.callout-note title="Solution"} Solution here ::: ## c. Random Forest Tuning Now we will vary the tuning parameters of `mtry` and `min.bucket` to see what effect they have on performance. - Use a range of reasonable `mtry` and `min.bucket` values. - The valid `mtry` values are $\{1,2, \ldots, p\}$ where $p$ is the number of predictor variables. However the default value of `mtry = sqrt(p) =` `r sqrt(ncol(ames)-1) %>% floor()` is usually close to optimal, so you may want to focus your search around those values. - The default `min.bucket=1` will grow deep trees. This will usually work best if there are enough trees. But try some values larger and see how it impacts predictive performance. - Set `num.trees=1000`, which is larger than the default of 500. - Use 5 times repeated out-of-bag (OOB) to assess performance. That is, run random forest 5 times for each tuning set, calculate the OOB MSE each time and use the average for the MSE associated with the tuning parameters. - Use a plot to show the average MSE as a function of `mtry` and `min.bucket`. - Report the best tuning parameter combination. - Note: random forest is a stochastic model; it will be different every time it runs due to the bootstrap sampling and random selection of features to consider for splitting. Set the random seed to control the uncertainty associated with the stochasticity. - Hint: If you use the `ranger` package, the `prediction.error` element in the output is the OOB MSE. ::: {.callout-note title="Solution"} Solution here :::