---
title: "Homework #1: Supervised Learning"
author: "**Your Name Here**"
format: sys6018hw-html
---

```{r config}
#| include: false
# Set global configurations and settings here
knitr::opts_chunk$set()                 # set global chunk options
ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme
```


# Required R packages and Directories  {.unnumbered .unlisted}

```{r packages}
#| message: false
#| warning: false
library(tidyverse) # functions for data manipulation
```


# Problem 1: Evaluating a Regression Model

## a. Data generating functions
Create a set of functions to generate data from the following distributions:

\begin{align*}
X &\sim \mathcal{N}(0, 1) \\
Y &= -1 + .5X + .2X^2 + \epsilon \\
\epsilon &\sim \mathcal{N}(0,\, \sigma)
\end{align*}


::: {.callout-note title="Solution"}
Add solution here
:::


## b. Generate training data

Simulate $n=100$ realizations from these distributions using $\sigma=3$. Produce a scatterplot and draw the true regression line $f(x) = E[Y \mid X=x]$.

- Use `set.seed(611)` prior to generating the data.


::: {.callout-note title="Solution"}
Add solution here
:::


## c. Fit three models

Fit three polynomial regression models using least squares: linear, quadratic, and cubic. Produce another scatterplot, add the fitted lines and true population line $f(x)$  using different colors, and add a legend that maps the line color to a model.

- Note: The true model is quadratic, but we are also fitting linear (less complex) and cubic (more complex) models.

::: {.callout-note title="Solution"}
Add solution here
:::

## d. Predictive performance

Generate a *test data* set of 10,000 observations from the same distributions. Use `set.seed(612)` prior to generating the test data.

- Calculate the estimated mean squared error (MSE) for each model.
- Are the results as expected?

::: {.callout-note title="Solution"}
Add solution here
:::

## e. Optimal performance

What is the best achievable MSE? That is, what is the MSE if the true $f(x)$ was used to evaluate the test set? How close does the best method come to achieving the optimum?

::: {.callout-note title="Solution"}
Add solution here
:::

## f. Replication

The MSE scores obtained in part *d* came from one realization of training data. Here will we explore how much variation there is in the MSE scores by replicating the simulation many times.

- Re-run parts b. and c. (i.e., generate training data and fit models) 100 times.
    - Do not generate new testing data
    - Use `set.seed(613)` prior to running the simulation and do not set the seed in any other places.
- Calculate the test MSE for all simulations.
    - Use the same test data from part d. (This question is only about the variability that comes from the *training data*).
- Create kernel density or histogram plots of the resulting MSE values for each model.

::: {.callout-note title="Solution"}
Add solution here
:::

## g. Best model

Show a count of how many times each model was the best. That is, out of the 100 simulations, count how many times each model had the lowest MSE.

::: {.callout-note title="Solution"}
Add solution here
:::

## h. Function to implement simulation

Write a function that implements the simulation in part *f*. The function should have arguments for i) the size of the training data $n$, ii) the standard deviation of the random error $\sigma$, and iii) the test data.  Use the same `set.seed(613)`. 

::: {.callout-note title="Solution"}
Add solution here
:::

## i. Performance when $\sigma=2$ 

Use your function to repeat the simulation in part *f*, but use $\sigma=2$. Report the number of times each model was best (you do not need to produce any plots). 

- Be sure to generate new test data with ($n = 10000$, $\sigma = 2$, using `seed = 612`). 

::: {.callout-note title="Solution"}
Add solution here
:::

## j. Performance when $\sigma=4$ and $n=300$

Repeat *i*, but now use $\sigma=4$ and $n=300$.

- Be sure to generate new test data with ($n = 10000$, $\sigma = 4$, using `seed = 612`). 

::: {.callout-note title="Solution"}
Add solution here
:::

## k. Understanding

Describe the effects $\sigma$ and $n$ has on selection of the best model? Why is the *true* model form (i.e., quadratic) not always the *best* model to use when prediction is the goal?

::: {.callout-note title="Solution"}
Add solution here
:::