--- title: "Lab 4" author: "STAT 302" date: "Due Date Here" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` *If you collaborated with anyone, you must include "Collaborated with: FIRSTNAME LASTNAME" at the top of your lab!* ## Part 1. Training and Test Error (10 points) Use the following code to generate data: ```{r, message = FALSE} library(ggplot2) # generate data set.seed(302) n <- 30 x <- sort(runif(n, -3, 3)) y <- 2*x + 2*rnorm(n) x_test <- sort(runif(n, -3, 3)) y_test <- 2*x_test + 2*rnorm(n) df_train <- data.frame("x" = x, "y" = y) df_test <- data.frame("x" = x_test, "y" = y_test) # store a theme my_theme <- theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), plot.subtitle = element_text(hjust = 0.5)) # generate plots g_train <- ggplot(df_train, aes(x = x, y = y)) + geom_point() + xlim(-3, 3) + ylim(min(y, y_test), max(y, y_test)) + labs(title = "Training Data") + my_theme g_test <- ggplot(df_test, aes(x = x, y = y)) + geom_point() + xlim(-3, 3) + ylim(min(y, y_test), max(y, y_test)) + labs(title = "Test Data") + my_theme g_train g_test ``` **1a.** For every k in between 1 and 10, fit a degree-k polynomial linear regression model with `y` as the response and `x` as the explanatory variable(s). (*Hint: Use *`poly()`*, as in the lecture slides.*) **1b.** For each model from (a), record the training error. Then predict `y_test` using `x_test` and also record the test error. **1c.** Present the 10 values for both training error and test error on a single table. Comment on what you notice about the relative magnitudes of training and test error, as well as the trends in both types of error as $k$ increases. **1d.** If you were going to choose a model based on training error, which would you choose? Plot the data, colored by split. Add a line to the plot representing your selection for model fit. Add a subtitle to this plot with the (rounded!) test error. (*Hint: See Lecture Slides 9 for example code.*) **1e.** If you were going to choose a model based on test error, which would you choose? Plot the data, colored by split. Add a line to the plot representing your selection for model fit. Add a subtitle to this plot with the (rounded!) test error. **1f.** What do you notice about the shape of the curves from part (d) and (e)? Which model do you think has lower bias? Lower variance? Why? ## Part 2. k-Nearest Neighbors Cross-Validation (10 points) For this part, note that there are tidyverse methods to perform cross-validation in R (see the `rsample` package). However, your goal is to understand and be able to implement the algorithm "by hand", meaning that automated procedures from the `rsample` package, or similar packages, will not be accepted. To begin, load in the popular `penguins` data set from the package `palmerpenguins`. ```{r} library(palmerpenguins) data(package = "palmerpenguins") ``` Our goal here is to predict output class `species` using covariates `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`. All your code should be within a function `my_knn_cv`. **Input:** * `train`: input data frame * `cl`: true class value of your training data * `k_nn`: integer representing the number of neighbors * `k_cv`: integer representing the number of folds *Please note the distinction between `k_nn` and `k_cv`!* **Output:** a list with objects * `class`: a vector of the predicted class $\hat{Y}_{i}$ for all observations * `cv_err`: a numeric with the cross-validation misclassification error You will need to include the following steps: * Within your function, define a variable `fold` that randomly assigns observations to folds $1,\ldots,k$ with equal probability. (*Hint: see the example code on the slides for k-fold cross validation*) * Iterate through $i = 1:k$. * Within each iteration, use `knn()` from the `class` package to predict the class of the $i$th fold using all other folds as the training data. * Also within each iteration, record the prediction and the misclassification rate (a value between 0 and 1 representing the proportion of observations that were classified **incorrectly**). * After you have done the above steps for all $k$ iterations, store the vector `class` as the output of `knn()` with the full data as both the training and the test data, and the value `cv_error` as the average misclassification rate from your cross validation. **Submission:** To prove your function works, apply it to the `penguins` data. Predict output class `species` using covariates `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`. Use $5$-fold cross validation (`k_cv = 5`). Use a table to show the `cv_err` values for 1-nearest neighbor and 5-nearest neighbors (`k_nn = 1` and `k_nn = 5`). Comment on which value had lower CV misclassification error and which had lower training set error (compare your output `class` to the true class, `penguins$species`).