--- title: "Introduction to model evaluation" subtitle: "Data Science for Biologists, Spring 2020" author: "YOUR NAME GOES HERE" output: html_document: highlight: tango --- ```{r setup, include=FALSE} knitr::opts_chunk\$set(echo = TRUE) library(tidyverse) library(broom) library(modelr) library(patchwork) ``` ## Instructions Standard grading criteria apply, except there is no "answer style" - just write out answers normally! **Make sure your bulleted lists render appropriately in the knitted output!!!** This assignment will use an external dataset of various physical measurements from 250 adult males. Our goal for this assignment is to build and evaluate a model from this data to **predict body fat percentage** (column `Percent`) in adult males, and then use this model to predict future outcomes. Age is measured in years, weight in pounds, height in inches, and all other measurements are circumference measured in cm. ```{r, collapse=T} fatmen <- read_csv("https://raw.githubusercontent.com/sjspielman/datascience_for_biologists/master/data/bodyfat.csv") dplyr::glimpse(fatmen) ``` ## Part 1: Build a model using AIC stepwise model selection Using the `step()` function, determine the most appropriate model to explain variation in bodyfat percentage in this data. Examine the model output with the `summary` function, and answer questions below. **You will use this model (aka you will specify these predictors) for all model evaluation questions.** ```{r} ## Use step() to build and save a model to explain Percent. PLEASE use the argument trace=F when calling step()!! ## Examine output with summary OR broom functions tidy and glance ``` #### Part 1 questions: Answer the questions in the templated bullets! 1. In a bulleted list below, state the predictor variables for the final model and their P-values. You do not need to worry about coefficients!! + (replace with first predictor) + (replace with second predictor) + (etc) 2. What percentage of variation in bodyfat percentage is explained by this model? + (explained answer) 3. What percentage of variation in bodyfat percentage is UNEXPLAINED by this model? + (unexplained answer) 4. What is the RMSE of your model? Hint: you need to run some code! ```{r} ## code to get RMSE of model, using the function modelr::rmse() ``` + (RMSE answer) ## Part 2: Evaluate the model using several approaches ### Part 2.1: Training and testing approach **First, use a simple train/test approach**, where the training data is a random subset comprising 65% of the total dataset. Determine the R-squared (`modelr::rsquare()`) and RMSE (`modelr::rmse()`) as determined from the training AND testing data. ```{r} ## split data into train and test, using this variable as part of your code: training_frac <- 0.65 ## Train model on training data. DO NOT USE summary(), just fit the model with the training data. ## Determine metrics on TRAINING data (R-squared and RMSE), using the trained model ## Determine metrics on TESTING data (R-squared and RMSE), using the trained model ``` #### Part 2.1 questions: Answer the questions in the templated bullets! 1. Compare the training data \$R^2\$ to the testing data \$R^2\$. Which is higher (i.e., does the model run on training or testing data explain more variation in Percent), and is this outcome expected? + (Answer goes here) 2. Compare the training data *RMSE* to the testing data *RMSE*. Which is *lower* (i.e., is there more error from the model run on training or testing data), and is this outcome expected? + (Answer goes here) ### Part 2.2: K-fold cross validation Use k-fold cross validation with **15 folds** to evaluate the model. Determine the \$R^2\$ and RMSE for each fold, and *visualize* the distributions of \$R^2\$ and RMSE in two separate plots that you *add together with patchwork*. You should also calculate the mean \$R^2\$ and mean RMSE values. ```{r} ## First define the FUNCTION you will use with purrr::map which contains your linear model. ## Do NOT use step() in here - you should have used step in Part 1 to know which predictors should be included here my_bodyfat_model <- function(input_data){ # eh?? what goes in here?? } ## perform k-fold cross validation, using this variable in your code number_folds <- 15 ## Calculate the mean R^2 and RMSE ## Make figures for R^2 and RMSE, which clearly show the MEAN values for each distribution using stat_summary() or similar (unless you make a boxplot, which already shows the median) ``` #### Part 2.2 questions: Answer the questions in the templated bullets! 1. Examine your distribution of \$R^2\$ values. What is the average \$R^2\$, and how does it compare to the **testing \$R^2\$** from Part 1? + (Answer goes here) 2. Examine your distribution of *RMSE* values. What is the average *RMSE*, and how does it compare to the **testing RMSE** from Part 1? + (Answer goes here) ### Part 2.3: Leave-one-out cross validation (LOOCV) ```{r} ## perform LOOCV (using the function my_bodyfat_model defined in Part 2.2) ## Calculate the mean of RMSE ## Make figure of RMSE distribution, which clearly shows the MEAN value for the distribution using stat_summary() (unless you make a boxplot, which already shows the median) ``` #### Part 2.3 question: Answer the questions in the templated bullets! 1. Examine your distribution of *RMSE* values. What is the average *RMSE*, and how does it compare to the **testing RMSE** from Part 1? How does it compare to the average *RMSE* from k-fold cross validation? + (Answer goes here) ### Part 2.4: Wrap-up Considering all three approaches, do you believe this model is highly explanatory of Percent (e.g., how are the \$R^2\$ values)? Further, do you believe the error in this model is low, moderate or high (e.g., how are the RMSE values)? Answer in 1-2 sentences in the bullet: + (Answer goes here) ## Part 3: Predictions New men have arrived, and we want to use our model to predict their body fat percentages! Using the function `modelr::add_predictions()` use our model to predict what the body fat percentages will be for three men with the following physical attributes. + Bob + 37 years of Age + Weight of 195 pounds + 43.6 cm Neck circumference + 110.6 cm Abdomen circumference + 71.7 cm Thigh circumference + 31.2 Forearm circumference + 19.2 Wrist circumference + Bill + 65 years of Age + Weight of 183 pounds + 41.2 cm Neck circumference + 90.1 cm Abdomen circumference + 77.5 cm Thigh circumference + 32.2 cm Forearm circumference + 18.2 cm Wrist circumference + Fred + 19 years of Age + Weight of 121 pounds + 30.2 cm Neck circumference + 68 cm Abdomen circumference + 48.1 cm Thigh circumference + 23.8 cm Forearm circumference + 16.1 cm Wrist circumference ```{r} ## Make a SINGLE tibble with THREE ROWS (one per observed new man), and use this tibble to predict outcomes with `modelr::add_predictions() ## HINT: See the tidyr assignment for different ways to make a tibble directly within R ``` #### Part 3 answers: Stick the answer after the colon for each bullet **in bold**: + Bob's predicted body fat percent is: + Bill's predicted body fat percent is: + Fred's predicted body fat percent is: **BONUS QUESTION!** Which of the three predictions (Bob, Bill, and Fred) do you think is LEAST reliable? You may need some code to figure out which one, so add in below as needed!!