--- title: "Homework #1: Supervised Learning" author: "**Your Name Here**" format: ds6030hw-html --- ```{r config} #| include: false # Set global configurations and settings here knitr::opts_chunk$set() # set global chunk options ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme ``` # Required R packages and Directories {.unnumbered .unlisted} ```{r packages} #| message: false #| warning: false library(tidyverse) # functions for data manipulation ``` # Problem 1: Evaluating a Regression Model ## a. Data generating functions Create a set of functions to generate data from the following distributions: \begin{align*} X &\sim \mathcal{N}(0, 1) \\ Y &= -1 + .5X + .2X^2 + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\, \sigma) \end{align*} ::: {.callout-note title="Solution"} Add solution here ::: ## b. Generate training data Simulate $n=100$ realizations from these distributions using $\sigma=3$. Produce a scatterplot and draw the true regression line $f(x) = E[Y \mid X=x]$. - Use `set.seed(611)` prior to generating the data. ::: {.callout-note title="Solution"} Add solution here ::: ## c. Fit three models Fit three polynomial regression models using least squares: linear, quadratic, and cubic. Produce another scatterplot, add the fitted lines and true population line $f(x)$ using different colors, and add a legend that maps the line color to a model. - Note: The true model is quadratic, but we are also fitting linear (less complex) and cubic (more complex) models. ::: {.callout-note title="Solution"} Add solution here ::: ## d. Predictive performance Generate a *test data* set of 10,000 observations from the same distributions. Use `set.seed(612)` prior to generating the test data. - Calculate the estimated mean squared error (MSE) for each model. - Are the results as expected? ::: {.callout-note title="Solution"} Add solution here ::: ## e. Optimal performance What is the best achievable MSE? That is, what is the MSE if the true $f(x)$ was used to evaluate the test set? How close does the best method come to achieving the optimum? ::: {.callout-note title="Solution"} Add solution here ::: ## f. Replication The MSE scores obtained in part *d* came from one realization of training data. Here will we explore how much variation there is in the MSE scores by replicating the simulation many times. - Re-run parts b. and c. (i.e., generate training data and fit models) 100 times. - Do not generate new testing data - Use `set.seed(613)` prior to running the simulation and do not set the seed in any other places. - Calculate the test MSE for all simulations. - Use the same test data from part d. (This question is only about the variability that comes from the *training data*). - Create kernel density or histogram plots of the resulting MSE values for each model. ::: {.callout-note title="Solution"} Add solution here ::: ## g. Best model Show a count of how many times each model was the best. That is, out of the 100 simulations, count how many times each model had the lowest MSE. ::: {.callout-note title="Solution"} Add solution here ::: ## h. Function to implement simulation Write a function that implements the simulation in part *f*. The function should have arguments for i) the size of the training data $n$, ii) the standard deviation of the random error $\sigma$, and iii) the test data. Use the same `set.seed(613)`. ::: {.callout-note title="Solution"} Add solution here ::: ## i. Performance when $\sigma=2$ Use your function to repeat the simulation in part *f*, but use $\sigma=2$. Report the number of times each model was best (you do not need to produce any plots). - Be sure to generate new test data with ($n = 10000$, $\sigma = 2$, using `seed = 612`). ::: {.callout-note title="Solution"} Add solution here ::: ## j. Performance when $\sigma=4$ and $n=300$ Repeat *i*, but now use $\sigma=4$ and $n=300$. - Be sure to generate new test data with ($n = 10000$, $\sigma = 4$, using `seed = 612`). ::: {.callout-note title="Solution"} Add solution here ::: ## k. Understanding Describe the effects $\sigma$ and $n$ has on selection of the best model? Why is the *true* model form (i.e., quadratic) not always the *best* model to use when prediction is the goal? ::: {.callout-note title="Solution"} Add solution here :::