---
title: "Regularized Regression"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(glmnet)
set.seed(03082021)
```

In a regression framework the goal is to model the relationship between a variable of interest, $y$, and a set of covariates $\mathcal{X}$. Using the normal distributional assumptions then the joint distribution of the observed data, given the data $x_1, \dots, x_n$ along with $\beta$ and $\sigma^2$ can be written as:
\begin{eqnarray}
p(y_1, \dots, y_n|\tilde{x}_1, \dots, \tilde{x_n}, \tilde{\beta}, \sigma^2) &=& \prod_{i=1}^n p(y_i|\tilde{x}_i, \tilde{\beta}, \sigma^2)\\
&=& (2 \pi \sigma^2)^{-n/2} \exp\left[-\frac{1}{2 \sigma^2} \sum_{i=1}^n\left(y_i - \tilde{\beta}^T \tilde{x}_i \right)^2 \right].
\end{eqnarray}
Note this is the same as the sampling, or generative model, that we have seen earlier in class.
\vfill
The model is often formulated using matrix expressions and a multivariate normal distribution. Let
\vfill

\vfill
\vfill

In a classical setting, typically least squares methods are used to compute the values of the covariates in a regression setting. Specifically, we seek to minimize the sum of squared residuals ($SSR$), where $SSR(\tilde{\beta}) = \left(\tilde{y} - X \tilde{\beta}\right)^T \left(\tilde{y} - X \tilde{\beta}\right).$ 

\vfill
Thus we will take the derivative of this function with respect to $\beta$ to minimize this expression.
\begin{eqnarray}
\frac{d}{d\tilde{\beta}} SSR(\tilde{\beta}) &=& \frac{d}{d\tilde{\beta}} \left(\tilde{y} - X \tilde{\beta}\right)^T \left(\tilde{y} - X \tilde{\beta}\right)\\
&=& \frac{d}{d\tilde{\beta}} \left( \tilde{y}^T \tilde{y} - 2 \tilde{\beta}^TX^T\tilde{y} + \tilde{\beta}^T X^T X \tilde{\beta}\right)\\
&=& -2X^T \tilde{y} + 2 X^T X \tilde{\beta} \\
 \text{ then set = 0}&\text{which implies}& X^TX \beta = X^t \tilde{y}\\
&\text{and}& \tilde{\beta} = (X^TX)^{-1} X^t \tilde{y}.
\end{eqnarray}
This value is the OLS estimate of $\tilde{\beta}_{OLS} = (X^TX)^{-1} X^T \tilde{y}$.
\vfill

\vfill

\newpage


##### Bayesian Modeling and Regularization
Ordinary Least Squares (OLS) regression can be written as:
\begin{equation*}
\hat{\tilde{\beta}}_{OLS} = \text{arg min}_{\hat{\tilde{\beta}}} \hspace{.2cm} ||\tilde{y} - X\hat{\tilde{\beta}}||^2_2 \rightarrow \hat{\tilde{\beta}} = (X^TX)^{-1} X^T \tilde{y},
\end{equation*}
where $||\tilde{x}||_p=\left(|x_1|^p + \dots + |x_m|^p \right)^{1/p}$ is an LP norm. So the L2 norm is $||\tilde{x}||_2= \sqrt{x_1^2 + \dots x_m^2}.$ 
\vfill


\vfill

\vfill
\vfill
An alternative form of penalized regression is known as Least Absolute Shrinkage and Selection Operator (LASSO). The LASSO uses an L1 penalty such that:
\vfill

\vfill
Consider the following prior $p(\tilde{\beta}) = N(0, I_p\tau^2)$. How does this relate to ridge regression? First compute the posterior distribution for $\tilde{\beta}$.
\begin{eqnarray*}
p(\tilde{\beta}|-) &\propto& \exp \left[-\frac{1}{2}\left(\frac{1}{\sigma^2}\tilde{\beta}^T X^T X \tilde{\beta} - \frac{1}{\sigma^2} \tilde{\beta}^T X^T \tilde{y} + \tilde{\beta}^T \frac{I_p}{\tau^2} \tilde{\beta} \right) \right]\\
&\propto& \exp \left[-\frac{1}{2}\left(\frac{1}{\sigma^2}\tilde{\beta}^T\left( X^T X + \frac{\sigma^2}{\tau^2} I_p \right)\tilde{\beta} - \frac{1}{\sigma^2} \tilde{\beta}^T X^T \tilde{y} \right) \right]
\end{eqnarray*}
\vfill

\newpage


### Simulate Regression Data

```{r}
n <- 50
p <- 10

X <- cbind(matrix(runif(n*p,min = -1, max = 1), n, p))
beta <- rep(0, p )
beta[c(1,2)] <- c(3, 3)

y <- rnorm(n, mean = X %*% beta, sd = 3)
y_center <- y - mean(y)
summary(lm(y_center~X))
```

### Ridge Regression

```{r}
ridge_fit <- glmnet(X, y_center, alpha = 0, lambda = 0)
t(coef(ridge_fit))
```

\newpage

```{r}
t(coef(glmnet(X, y_center, alpha = 0, lambda = 1e6)))
ridge_fit_cv <- cv.glmnet(X, y_center, alpha = 0)
lambdas_seq <- 10^seq(-3, 5, length.out = 100)
ridge_cv <- cv.glmnet(X, y_center, alpha = 0, lambda = lambdas_seq)
plot(ridge_cv)
print(ridge_cv) 
```

\newpage

```{r}
ridge_fit <- glmnet(X, y_center, alpha = 0, lambda = ridge_cv$lambda.min)
t(coef(ridge_fit))
```

### Lasso

```{r}
t(coef(glmnet(X, y_center, alpha = 0, lambda = 0)))
t(coef(glmnet(X, y_center, alpha = 1, lambda = 1e6)))


lasso_cv <- cv.glmnet(X, y_center, alpha = 1, lambda = lambdas_seq)
plot(lasso_cv) 

lasso_fit <- glmnet(X, y_center, alpha = 1, lambda = lasso_cv$lambda.min)
coef(lasso_fit)
```