--- title: "Homework #8: Feature Engineering and Importance" author: "**Your Name Here**" format: sys6018hw-html --- ```{r config, include=FALSE} # Set global configurations and settings here knitr::opts_chunk$set() # set global chunk options ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme ``` # Required R packages and Directories {.unnumbered .unlisted} ```{r packages, message=FALSE, warning=FALSE} dir_data = 'https://mdporter.github.io/SYS6018/data/' # data directory library(tidyverse) # functions for data manipulation ``` # Problem 1: Permutation Feature Importance Vanderbilt Biostats has collected data on Titanic survivors (https://hbiostat.org/data/). I have done some simple processing and split into a training and test sets. - [titanic_train.csv](`r file.path(dir_data, "titanic_train.csv")`) - [titanic_test.csv](`r file.path(dir_data, "titanic_test.csv")`) We are going to use this data to investigate feature importance. Use `Class`, `Sex`, `Age`, `Fare`, `sibsp` (number of siblings or spouse on board), `parch` (number of parents or children on board), and `Joined` (city where passenger boarded) for the predictor variables (features) and `Survived` as the outcome variable. ## a. Method 1: Built-in importance scores Fit a tree ensemble model (e.g., Random Forest, boosted tree) on the training data. You are free to use any method to select the tuning parameters. Report the built-in feature importance scores and produce a barplot with feature on the x-axis and importance on the y-axis. ::: {.callout-note title="Solution"} Add solution here ::: ## b. Performance Report the performance of the model fit from (a.) on the test data. Use the log-loss (where $M$ is the size of the test data): $$ \text{log-loss}(\hat{p}) = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] $$ ::: {.callout-note title="Solution"} Add solution here ::: ## c. Method 2: Permute *after* fitting Use the fitted model from question (a.) to perform permutation feature importance. Shuffle/permute each variable individually on the *test set* before making predictions. Record the loss. Repeat $M=10$ times and produce a boxplot of the change in loss (change from reported loss from part b.). ::: {.callout-note title="Solution"} Add solution here ::: ## d. Method 3: Permute *before* fitting For this approach, shuffle/permute the *training data* and re-fit the ensemble model. Evaluate the predictions on the (unaltered) test data. Repeat $M=10$ times (for each predictor variable) and produce a boxplot of the change in loss. ::: {.callout-note title="Solution"} Add solution here ::: ## e. Understanding Describe the benefits of each of the three approaches to measure feature importance. ::: {.callout-note title="Solution"} Add solution here ::: # Problem 2: Effects of correlated predictors This problem will illustrate what happens to the importance scores when there are highly associated predictors. ## a. Create an almost duplicate feature Create a new feature `Sex2` that is 95% the same as `Sex`. Do this by selecting 5% of training ($n=50$) and testing ($n=15$) data and flip the `Sex` value. ::: {.callout-note title="Solution"} Add solution here ::: ## b. Method 1: Built-in importance Fit the same model as in Problem 1a, but use the new data that includes `Sex2` (i.e., use both `Sex` and `Sex2` in the model). Calculate the built-in feature importance score and produce a barplot. ::: {.callout-note title="Solution"} Add solution here ::: ## c. Method 2: Permute *after* fitting Redo Method 2 (problem 1c) on the new data/model and produce a boxplot of importance scores. The importance score is defined as the difference in loss. ::: {.callout-note title="Solution"} Add solution here ::: ## d. Method 3: Permute *before* fitting Redo Method 3 (problem 1d) on the new data and produce a boxplot of importance scores. The importance score is defined as the difference in loss. ::: {.callout-note title="Solution"} Add solution here ::: ## e. Understanding Describe how the addition of the almost duplicated predictor impacted the feature importance results. ::: {.callout-note title="Solution"} Add solution here :::