Overview

Due before class May 7th.

Fork the hw06 repository

Go here to fork the repo for homework 06.

Part 1: Sexy Joe Biden

Former Vice President Joe Biden eating an ice cream cone

Former Vice President Joe Biden eating an ice cream cone

Using statistical learning and data from the 2008 American National Election Studies survey, evaluate whether or not Leslie Knope’s attitudes towards Joe Biden are part of a broader trend within the American public. Specifically, do women display higher feeling thermometer ratings for Joe Biden than men?1 biden.csv contains a selection of variables from the larger survey that also allow you to test competing factors that may influence attitudes towards Joe Biden.

  1. Estimate a basic (single variable) linear regression model of the relationship between gender and feelings towards Joe Biden. Calculate predicted values, graph the relationship between the two variables using the predicted values, and determine whether there appears to be a significant relationship.
  2. Build the best predictive linear regression model of attitudes towards Joe Biden given the variables you have available. In this context, “best” is defined as the model with the lowest MSE. Compare at least three different model formulations (aka different combinations of variables). Use 10-fold cross-validation to avoid a biased estimate of MSE.

Part 2: Revisiting the Titanic

We’ve looked a lot at the Titanic data set. Now I want you to make your own predictions about who lived and who died.

  1. Load the Titanic data from library(titanic). Use the titanic_train data frame.
  2. Estimate three different logistic regression models with Survived as the response variable. You may use any combination of the predictors to estimate these models. Don’t just reuse the models from the notes.
    1. Calculate the leave-one-out-cross-validation error rate for each model. Which model performs the best?
  3. Now estimate three random forest models. Generate random forests with 500 trees apiece.
    1. Generate variable importance plots for each random forest model. Which variables seem the most important?
    2. Calculate the out-of-bag error rate for each random forest model. Which performs the best?

Submit the assignment

Your assignment should be submitted as a set of R scripts, R Markdown documents, data files, figures, etc. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Cannot get code to run or is poorly documented. No documentation in the README file. Severe misinterpretations of the results. Overall a shoddy or incomplete assignment.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible. Writes a user-friendly README file. Discusses the benefits and drawbacks of a specific method. Compares multiple models fitted to the same underlying dataset.


  1. Feeling thermometers are a common metric in survey research used to gauge attitudes or feelings of warmth towards individuals and institutions. They range from 0-100, with 0 indicating extreme coldness and 100 indicating extreme warmth.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.