In this assignment, we will learn how to implement three important regression techniques - linear regression, ridge regression and the LASSO - in R. We’ll apply these three modeling techniques to the preloaded R dataset mtcars.
Linear regression is used to model a continuous response variable \(y\) as a linear combination of \(p\) predictors taking values \(x_1, \ldots, x_p\). Suppose that one observes \(n\) samples of \(y\) and associated predictors (in this case \(y\), \(x_1, \ldots, x_p\) are all \(n\) dimensional vectors). Define \(X: = (x_1 \hskip .1pc x_2 \ldots x_p)\) as the \(n \times p\) design matrix. The stochastic linear regression model of \(y\) on \(x_1, \ldots x_p\) is given by \[ (1):~~y = X\beta + \epsilon \] where \(\epsilon = (\epsilon_1, \ldots, \epsilon_p)^T\) is a vector of uncorrelated errors with mean 0 and variance 1 and \(\beta = (\beta_1, \ldots, \beta_p)^T\) is the coefficient vector of unknown parameters. We typically assume a stronger condition that \(\epsilon_i \stackrel{iid}{\sim} N(0,1)\) to simplify statistical inference on \(\beta_0, \ldots, \beta_p\). One should be careful when applying model (1) as there are many conditions that should be verified. We don’t discuss these conditions here, though we recommend reading more about model selection for linear regression.
Recall from class that the least squares estimates \(\hat{\beta}\) is given by the normal equations: \[(2):~~\hat{\beta} = (X^TX)^{-1}X^Ty\] As we can see from (2), the calculation of \(\hat{\beta}\) relies upon the invertibility of \(X^TX\). Even if we assume \(p<n\), we still require that there is no perfectly linear dependence between the predictor vectors \(x_1, \ldots, x_p\). In other words, we require the design matrix \(X\) to have full rank \(p\). When \(rank(X) < p\), then \(X\) suffers from multicollinearity in which case additional tools are needed for estimation of \(\beta\). For example, penalization methods like ridge regression (squared penalization) or Lasso (L1 penalization) can be used to ``shrink" the estimates of \(\beta\) towards the origin.
We will use the mtcars dataset available in R as an example throughout this assignment. This dataset describes various quantitative features of 32 different automobiles. There are 11 total variables in this dataset. We will study miles per gallon (mpg) as a function of four other predictors:
Load and parse the data with the following code:
#load the data
data(mtcars)
#create design matrix
X = data.frame(disp = mtcars$disp, hp = mtcars$hp, wt = mtcars$wt, drat = mtcars$drat)
#create response
Y = mtcars$mpg
YOUR CODE HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
The lm(y ~ x, data)
command can be used to run a linear regression of \(y\) on \(x\). Here, \(y\) and \(x\) are both \(n\) dimensional vectors. The data argument is optional and specifies the source (a data frame) which \(x\) is contained. Once a linear regression has been run, we can use the summary( ) command to obtain coefficient estimates, standard errors of estimates, and p-values measuring the significance of each coefficient in the fitted model. Please type ?lm for more details. Fit and summarize a linear model of mpg on the remaining variables using the following code:
linear.reg = lm(Y ~ disp + hp + wt + drat, data = X)
summary(linear.reg)
Now to demonstrate a situation of perfect collinearity, consider including a fifth covariate which is exactly twice the value of the hp variable. Construct a new design matrix and try fitting a linear model using the following code:
#construct a new design matrix
X.new = data.frame(X, two.hp = 2*X$hp)
#attempt to fit a linear model
linear.reg.fail = lm(Y ~ disp + hp + wt + drat + two.hp, data = X.new)
#summarize the regression
summary(linear.reg.fail)
The lm.ridge(y \(\sim\) x, data, lambda) command is used to run a ridge regression of \(y\) on \(x\) with ridge parameter \(lambda\). Here, \(y\), \(x\) and \(data\) have the same interpretation as in the lm() function described in question 2. The lm.ridge() command is available in the MASS package in R, so be sure to load this package before use. Importantly, one must specify the ridge parameter \(\lambda\) for his/her/their choice of model. One principled way to choose \(\lambda\) is through cross validation.
For the moment, let’s try a few fixed values of \(\lambda\). First, fit a ridge regression for a fixed value of \(\lambda = 1\). Also, calculate the distance \(\hat{\beta}\) is from the origin (a.k.a the magnitude of \(\hat{\beta}\)) using the following code:
#run ridge regression with lambda = 1
ridge.reg.1 = lm.ridge(Y ~ disp + hp + wt + drat, data = X, lambda = 1)
#get a summary of the fit
coef.ridge.1 = coef(ridge.reg.1)
#calculate the magnitude of the estimated coefficient vector
dist.reg.1 = sum(abs(coef.ridge.1))
YOUR CODE HERE
YOUR CODE HERE
The glmnet(X, y, lambda) command can be used to fit the LASSO model of \(y\) on \(X\). The glmnet package contains the functions required to conduct LASSO. Download this package before proceeding using the install.packages() and library() commands. Like ridge regression, the LASSO relies on the choice of a penalty parameter, (call it \(\lambda_1\)). The glmnet(X, y, lambda) command is used to fit a LASSO regression with specified parameter lambda. In contrast to the lm() and ridge.lm() commands, the glmnet(X, y, lambda) command requires \(X\) to be an \(n \times p\) design matrix. As usual, \(y\) is the \(n\) dimensional vector of responses. Once again, we can choose the ``best" \(\lambda_1\) using cross validation but we’ll come back to this later. The coef() command is used to summarize the estimated coefficients of the model. Fit a LASSO model with \(\lambda_1 = 1\) to the mtcars data and calculate the magnitude of the estimated coefficients using the following commands:
#conduct a cross validation study to fit minimum MSE model
lasso.fit.1 = glmnet(as.matrix(X),Y,lambda = 1)
#summarize the estimated coefficients
coef.lasso.1 = coef(lasso.fit.1)
#calculate the magnitude of estimated coefficients
dist.lasso.1 = sum(abs(coef.lasso.1))
YOUR CODE HERE
YOUR CODE HERE
We will repeat the analysis that was done in class. Start by uncommenting and running the following code to install the BicocManager
and bcellViper
packages. If you are having any difficulties installing these packages, you can seek further instruction here.
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("bcellViper")
# install.packages("HDCI")
library(bcellViper)
data(bcellViper)
gene_expressions = data.frame(t(assayDataElement(dset,'exprs')))
The installation of the above packages needs only to be done once, and can be re-commented after this initial run.
gene_expressions
data, using ADA
as your response variables, and the remaining variables as your predictors. Comment on any irregularities in the model output. Explain why you are not getting a reasonable number for your degrees of freedom in the model summary.YOUR CODE, AND ANALYSIS, HERE
YOUR CODE, AND ANALYSIS, HERE