--- title: "Homework Unit 11: Model Comparisons and Other Explanatory Goals" author: "Your name here" date: "`r lubridate::today()`" format: html: embed-resources: true toc: true toc_depth: 4 editor_options: chunk_output_type: console --- ## Introduction To begin, download the following from the course web book (Unit 11): * `hw_unit_11_explanatory.qmd` (notebook for this assignment) * `student_perf_cln.csv` (data for this assignment) The data for this week's assignment have information about students and their final grade performance in a math class. The data will be cleaned using code included for you below, and you don't need to show any modeling EDA to keep the assignment from getting too long. *However, you should still always be checking your data as you progress through the assignment!* In this assignment, you will practice comparing models to determine the importance of a pair of focal variables. In particular, you will be determining whether parental education level (`mother_educ` and `father_educ`) matter for predicting student performance. You will then use the Bayesian approach to evaluate the effect of that focal variable. We will be using 30 splits (3 repeats of 10 fold cv) to tune and select the best GLM model configuration for two feature sets: a full model (with all of the predictors) and a compact model (with all predictors except the two focal variables: `mother_educ` and `father_educ`). We will also be introducing a new performance metric: **R-squared**. Yay! Let's get started! ----- ## Setup Set up your notebook in this section. ```{r} ``` ## Read in your data Read in the `student_perf_cln.csv` data file as `data_all`. Class variables as appropriate. Note the following levels for your focal variables (`mother_educ` and `father_educ`) are: 0 - none 1 - primary education (4th grade) 2 – 5th to 9th grade 3 – secondary education 4 – higher education ```{r} ``` ## Set up splits Split `data_all` into repeated k-fold cross-validation splits using 3 repeats and 10 folds. Save your splits as an object named `splits` ```{r} ``` ## Build recipes We will be fitting GLM models tuned on `penalty` and `mixture` for two feature sets: one for a "full" model (contains all predictors), and one for a compact model (contains all predictors except the two focal variables: `mother_educ` and `father_educ`). ### Recipe 1: Full model Create a recipe for setting up features for your full model. ```{r} rec_full <- ``` ### Recipe 2: Compact model Because your recipe for the compact model will only differ from your recipe for the full model by one step (removing your focal variables), you can start with `rec_full`. Then add a step that will remove the *dummy coded* focal variables `father_educ` and `mother_educ`. Remember you can make a feature matrix to see what these dummy coded variables end up being named! ```{r} rec_compact <- ``` ## Fit models ### Set up a hyperparameter tuning grid ```{r} tune_grid <- ``` ### Fit the full model Use `rec_full`, `splits`, and `tune_grid` to fit GLM's across folds. Use R squared (`rsq`) as your metric. Save your model fits as `fits_full`. ```{r} ``` Make sure you considered a good range of hyperparameter values ```{r} ``` Select best model configuration ```{r} ``` Print the mean R squared of the best model configuration across the 30 held-out folds. ```{r} collect_metrics(fits_full, summarize = TRUE) |> filter(.config == select_best(fits_full)$.config) ``` ### Fit the compact model Use `rec_compact`, `splits`, and `tune_grid` to fit GLM's across folds. Use R-squared as your metric. Save your model fits as `fits_compact`. ```{r} ``` Select best model configuration ```{r} ``` Print the mean R squared of the best model configuration across the 30 held-out folds. ```{r} collect_metrics(fits_full, summarize = TRUE) |> filter(.config == select_best(fits_compact)$.config) ``` ## Model comparison with the Bayesian approach You will now compare your models using the Bayesian parameter estimation ### Gather performance estimates First, make a data frame containing the 30 performance estimates from held out folds for your full and compact model. Hint you can use the code above but change `summarize = TRUE` to `summarize = FALSE`. Filter down so you only have the following variables (`id`, `id2`, `.estimate`). Rename `.estimate` to `full` or `compact` before joining your estimates. ```{r} ``` ### Posterior probabilities Next, derive the posterior probabilities for R squared of each of these two models ```{r} ``` ### Graph positerior probabilities Display your posterior probabilities using both a density plot and a histogram. Choose plots that would be the most useful to display your results to a collaborator. ```{r} ``` ### Determine if the full model has better performance Calculate the probability that the full model is performing better than the compact model. ```{r} ``` Type out a summary of what you find that includes: * Your interpretation of what the mean increase in R squared and 95% HDI values represent * Your conclusion if the full model is meaningfully better than your compact model and why *Type your summary here* ## Feature importance You will now look at feature importance in your full model using Shapely Values. ### Prep data Get your data ready for use with the `DALEX` package. Do the following: Create a feature matrix from the whole data set ```{r} ``` Pull out your features (`x`) and outcome (`y`) to be used in calculating feature importance ```{r} ``` Use the following code to define your predictor function (`predict_wrapper`) and explainer object (`explain_full`). ```{r} predict_wrapper <- function(model, newdata) { predict(model, newdata) |> pull(.pred) } explain_full <- explain_tidymodels(fit_all_data, data = x, y = y, predict_function = predict_wrapper) ``` ### Calculate Shapely Values for a single participant First, we will look at shapely values for a single participant. For this example, lets look at the last participant in our data set. Do the following: * Print the raw feature values for the last participant. * Generate Shapely Values for this participant (you might consider caching this!) * Use `ggplot()` to plot the Shapely Values for this participant ```{r} ``` ### Calculate mean absolute Shapely Values across all participants Now we will look at feature importance across all participants. Since this process is extremely computationally intensive, we are going to select a random set of participants to demonstrate this on. Include the following steps: * Use the function `slice_sample()` with `n` set to 20 to get a random subset of observations from `x`. Don't forget to set a seed first. Remember you can type `?slice_sample()` into your console to learn more about a function. * Calculate Shapely Values for each observation (in your reduced feature set) and `glimpse()` the resulting tibble * Plot the mean absolute Shapely Values across participants ```{r} ``` Write a brief description of your top takeaways from this plot. *Type your response here.*