---
title: "Homework #3: Penalized Regression" 
author: "**Your Name Here**"
format: sys6018hw-html
---

```{r config, include=FALSE}
# Set global configurations and settings here
knitr::opts_chunk$set()                 # set global chunk options
ggplot2::theme_set(ggplot2::theme_bw()) # set ggplot2 theme
```

# Required R packages and Directories {.unnumbered .unlisted}

```{r packages, message=FALSE, warning=FALSE}
dir_data = 'https://mdporter.github.io/teaching/data/' # data directory
library(mlbench)
library(glmnet)
library(tidymodels)# for optional tidymodels solutions
library(tidyverse) # functions for data manipulation  
```

# Problem 1: Optimal Tuning Parameters

In cross-validation, we discussed choosing the tuning parameter values that minimized the (average) cross-validation error. Another approach, called the "one-standard error" rule [ISL pg 214, ESL pg 61], uses the values corresponding to the least complex model whose average cv error is within one standard error of the best model. The goal of this assignment is to compare these two rules.

Use simulated data from `mlbench.friedman1(n, sd=2)` in the `mlbench` R package to fit *lasso models*. The tuning parameter $\lambda$ (corresponding to the penalty on the coefficient magnitude) is the one we will focus one. Generate training data, implement a resampling approach to get $\lambda_{\rm min}$ and $\lambda_{\rm 1SE}$, generate test data, make predictions for the test data, and compare performance of the two rules under a squared error loss using a hypothesis test.


Choose reasonable values for:

- Number of cv folds ($K$)
    - Note: you are free to use repeated CV, repeated hold-outs, or bootstrapping instead of plain cross-validation; just be sure to describe what do did so it will be easier to follow.
- Number of training and test observations
- Number of simulations
- If everyone uses different values, we will be able to see how the results change over the different settings.
- Don't forget to make your results reproducible (e.g., set seed)

This pseudo code (using k-fold cv) will get you started:
```yaml
library(mlbench)
library(glmnet)

#:-- Settings
n_train =        # number of training obs
n_test =         # number of test obs
K =              # number of CV folds
alpha =          # glmnet tuning alpha (1 = lasso, 0 = ridge)
M =              # number of simulations

#-- Data Generating Function
getData <- function(n) mlbench.friedman1(n, sd=2) # data generating function

#-- Simulations
# Set Seed Here

for(m in 1:M) {

# 1. Generate Training Data
# 2. Resample training data to estimate best tuning paramter(s) e.g., cv.glmnet()
# 3. Extract the lambda values that minimizes cv error and 1 SE rule
# 4. Generate Test Data
# 5. Predict y values for test data (for each model: min, 1SE)
# 6. Evaluate predictions, return performance metric(s)

}

#-- Compare
# Compare performance of the approaches / Statistical Test
```

## a. Code for the simulation and performance results

::: {.callout-note title="Solution"}
Add solution here
:::

## b. Hypothesis test

Provide results and discussion of a hypothesis test comparing the predictive performance using $\lambda_{\rm min}$ versus $\lambda_{\rm 1SE}$.

::: {.callout-note title="Solution"}
Add solution here
:::

# Problem 2 Prediction Contest: Real Estate Pricing

This problem uses the [realestate-train](`r file.path(dir_data, 'realestate-train.csv')`) and [realestate-test](`r file.path(dir_data, 'realestate-test.csv')`) (click on links for data).

The goal of this contest is to predict sale price (in thousands) (`price` column) using an *elastic net* model. Evaluation of the test data will be based on the root mean squared error ${\rm RMSE}= \sqrt{\frac{1}{m}\sum_i (y_i - \hat{y}_i)^2}$ for the $m$ test set observations.


## a. Load and pre-process data

Load the data and create necessary data structures for running *elastic net*.

- You are free to use any data transformation or feature engineering
- Note: there are some categorical predictors so at the least you will have to convert those to something numeric (e.g., one-hot or dummy coding).

::: {.callout-note title="Solution"}
Add solution here
:::


## b. Fit elastic net model

Use an *elastic net* model to predict the `price` of the test data.

- You are free to use any data transformation or feature engineering
- You are free to use any tuning parameters
- Report the $\alpha$ and $\lambda$ parameters you used to make your final predictions.
- Describe how you choose those tuning parameters

::: {.callout-note title="Solution"}
Add solution here
:::

## c. Submit predictions

Submit a .csv file (ensure comma separated format) named `lastname_firstname.csv` that includes your predictions in a column named *yhat*. We will use automated evaluation, so the format must be exact.

- You will receive credit for a proper submission; the top five scores will receive 2 bonus points.

::: {.callout-note title="Solution"}
Add solution here
:::

## d. Report anticpated performance

Report the anticipated performance of your method in terms of RMSE. We will see how close your performance assessment matches the actual value. 

::: {.callout-note title="Solution"}
Add solution here
:::