![](ISL_fig_6_2.pdf){width=600px height=350px}

$C_p$, BIC, and adjusted $R^2$ are shown for the best models of each size for the `Credit` data set. $C_p$ and BIC are estimates of test MSE. In the middle plot we see that the BIC estimate of test error shows an increase after four variables are selected. The other two plots are rather flat after four variables are included. ::: - $C_p$, AIC and BIC have theoretical justification. As sample size $n$ is very large, they are guaranteed to find the best model. Adjusted $R^2$ is less motivated in statistical theory. - Indirect methods hold great computational advantage. They just require a single fit of the training data. ### Direct approaches: validation and cross-validation We'll discuss validation and cross-validation (CV) in details in **Chapter 5**. - This procedure has an advantage relative to AIC, BIC, $C_p$, and adjusted $R^2$, in that it provides a direct estimate of the test error, and _doesn't require an estimate of the error variance $\sigma^2$_. - It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance $\sigma^2$. ::: {#fig-bic-val-cv-credit}![](ISL_fig_6_3.pdf){width=600px height=350px}

Square root of BIC, validation set error, and CV error for the `Credit` data. ::: ## Shrinkage methods - The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. - As an alternative, we can fit a model containing all $p$ predictors using a technique that **constrains** or **regularizes** the coefficient estimates, or equivalently, that **shrinks** the coefficient estimates towards zero. - It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. ### Ridge regression - Recall that the least squares estimates $\beta_0, \beta_1, \ldots, \beta_p$ using the value that minimizes $$ \text{RSS} = \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2. $$ - In contrast, the ridge regression coefficient estimates $\hat{\beta}^R$ are the values that minimize $$ \text{RSS} + \lambda \sum_{j=1}^p \beta_j^2, $$ where $\lambda \ge 0$ is a **tuning parameter**, to be determined separately. - Similar to least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. - However, the second term, $\lambda \sum_{j} \beta_j^2$, called a **shrinkage penalty**, is small when $\beta_1, \ldots, \beta_p$ are close to zero, and so it has the effect of **shrinking** the estimates of $\beta_j$ towards zero. ::: {.panel-tabset} #### Python Since scikit-learn only takes arrays $X$ and $y$, we first apply transformers to transform the dataframe `Credit` into arrays. There are two ways to treat dummy coding of of categorical variables: standardize or not. For consistency with ISL textbook, we standardize dummy coding here: ```{python} from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import make_column_transformer from sklearn.pipeline import Pipeline # Response vector y = Credit.Balance # Dummy encoding for categorical variables cattr = make_column_transformer( (OneHotEncoder(drop = 'first'), ['Own', 'Student', 'Married', 'Region']), remainder = 'passthrough' ) # Standardization for ALL variables including dummy coding # Note the order of columns changed (dummy coding go first) pipe = Pipeline([('cat_tf', cattr), ('std_tf', StandardScaler())]) X = pipe.fit_transform(Credit.drop('Balance', axis = 1)) X.shape ``` Alternatively (and preferred), we don't standardize dummy coding for categorical variables. ```{python} #| eval: false #| code-fold: true rom sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import make_column_transformer # Response vector y = Credit.Balance # Dummy encoding for categorical variables # Standardization for continuous variables coltr = make_column_transformer( (StandardScaler(), ['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education']), (OneHotEncoder(drop = 'first'), ['Own', 'Student', 'Married', 'Region']) ) X = coltr.fit_transform(Credit.drop('Balance', axis = 1)) X.shape ``` Fit ridge regression at a grid of tuning/regularization parameter values. Note the regularization parameter in the [ridge regression in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) is called $\alpha$. ```{python} from sklearn.linear_model import Ridge clf = Ridge() # Ridge regularization parameter alphas = np.logspace(start = -3, stop = 5, base = 10) # Train the model with different regularization strengths coefs = [] for a in alphas: clf.set_params(alpha = a) clf.fit(X, y) coefs.append(clf.coef_) ``` Visualize Ridge solution path ```{python} # Visualize Ridge solution path plt.figure(figsize = (8, 8)) ax = plt.gca() ax.plot(alphas, coefs) ax.set_xscale("log") ax.legend( labels = ["_", "Student", "_", "_", "_", "Income", "Limit", "Rating", "_", "_", "_"], fontsize = 16 ) plt.xlabel(r"$\alpha$") plt.ylabel("Coefficients") plt.title("Ridge solution path of Credit data") plt.axis("tight") plt.show() ``` #### R Data pre-processing by recipe. ```{r} library(tidymodels) norm_recipe <- recipe( Balance ~ ., data = Credit ) %>% # create traditional dummy variables step_dummy(all_nominal()) %>% # zero-variance filter step_zv(all_predictors()) %>% # center and scale numeric data step_normalize(all_predictors()) %>% # step_log(Salary, base = 10) %>% # estimate the means and standard deviations prep(training = Credit, retain = TRUE) norm_recipe ``` Set up ridge model. ```{r} lambda_grid <- c(0, 10^seq(-3, 5, length.out = 100)) ridge_mod <- # mixture = 0 (ridge), mixture = 1 (lasso) linear_reg(penalty = 1, mixture = 0) %>% set_engine("glmnet", path_values = lambda_grid) ridge_mod ``` Bundle recipe and model into a workfolow. ```{r} ridge_wf <- workflow() %>% add_model(ridge_mod) %>% add_recipe(norm_recipe) ridge_wf ``` Fit ridge regression. ```{r} fit_ridge <- ridge_wf %>% fit(data = Credit) ``` Visualize ridge solution path ```{r} broom:::tidy.glmnet(fit_ridge$fit$fit$fit) %>% print(width = Inf) %>% filter(term != "(Intercept)") %>% ggplot() + geom_line(mapping = aes(x = lambda, y = estimate, color = term)) + scale_x_log10() + labs( x = quote(lambda), y = "Standardized Coefficients", title = "Ridge solution path for Credit data" ) ``` ::: ::: {#fig-ridge-credit}![](ISL_fig_6_4.pdf){width=600px height=350px}

Ridge solution path for the `Credit` data (ISL2 Figure 6.4). ::: - The standard least squares coefficient estimates are **scale equivariant**: multiplying $X_j$ by a constant $c$ simply leads to a scaling of the least squares coefficient estimates by a factor of $1/c$. In other words, regardless of how the $j$th predictor is scaled, $X_j \hat{\beta}_j$ will remain the same. - In contrast, the ridge regression coefficient estimates can change **substantially** when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function. - Therefore, it is best to apply ridge regression after **standardizing** the predictors, using the formula $$ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}}. $$ - Why Does Ridge Regression Improve Over Least Squares? Answer: the bias-variance tradeoff. ::: {#fig-ridge-bias-variance-tradeoff}![](ISL_fig_6_5.pdf){width=600px height=350px}

Simulated data with $n = 50$ observations, $p = 45$ predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of $\lambda$ and $\|\hat{\beta}_{\lambda}^R\|_2 / \|\hat{\beta}\|_2$. The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the test MSE is smallest. ::: - The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates. - Selecting a good value for $\lambda$ is critical. **Cross-validation** can be used for this. ### Lasso - Ridge regression does have one obvious disadvantage: unlike subset selections, which will generally select models that involve just a subset of the variables, ridge regression will include all $p$ predictors in the final model. - **Lasso** is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, $\hat{\beta}_\lambda^L$, minimize the quantity $$ \text{RSS} + \lambda \sum_{j=1}^p |\beta_j|. $$ - In statistical parlance, the lasso uses an $\ell_1$ (pronounced "ell 1") penalty instead of an $\ell_2$ penalty. The $\ell_1$ norm of a coefficient vector $\beta$ is given by $\|\beta\|_1 = \sum_j |\beta_j|$. - As with ridge regression, the lasso shrinks the coefficient estimates towards zero. - However, in the case of the lasso, the $\ell_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter $\lambda$ is sufficiently large. - Hence, much like best subset selection, the lasso performs **variable selection**. - We say that the lasso yields **sparse** models. The models involve only a subset of the variables. - As in ridge regression, selecting a good value of $\lambda$ for the lasso is critical. Cross-validation is again the method of choice. ::: {.panel-tabset} #### Python ```{python} from sklearn.linear_model import Lasso clf = Lasso() # Ridge regularization parameter alphas = np.logspace(start = -3, stop = 4, base = 10) # Train the model with different regularization strengths coefs = [] for a in alphas: clf.set_params(alpha = a) clf.fit(X, y) coefs.append(clf.coef_) ``` ```{python} # Visualize Ridge solution path plt.figure(figsize = (10, 8)) ax = plt.gca() ax.plot(alphas, coefs) ax.set_xscale("log") ax.legend( labels = ["_", "Student", "_", "_", "_", "Income", "Limit", "Rating", "_", "_", "_"], fontsize = 16 ) plt.xlabel(r"$\alpha$") plt.ylabel("Coefficients") plt.title("Lasso solution path of Credit data") plt.axis("tight") plt.show() ``` #### R Set up lasso model. ```{r} lambda_grid <- c(0, 10^seq(-3, 4, length.out = 100)) lasso_mod <- # mixture = 0 (ridge), mixture = 1 (lasso) linear_reg(penalty = 1, mixture = 1) %>% set_engine("glmnet", path_values = lambda_grid) lasso_mod ``` Bundle recipe and model into a workfolow. ```{r} lasso_wf <- workflow() %>% add_model(lasso_mod) %>% add_recipe(norm_recipe) lasso_wf ``` Fit ridge regression. ```{r} fit_lasso <- lasso_wf %>% fit(data = Credit) ``` Visualize ridge solution path ```{r} broom:::tidy.glmnet(fit_lasso$fit$fit$fit) %>% print(width = Inf) %>% filter(term != "(Intercept)") %>% ggplot() + geom_line(mapping = aes(x = lambda, y = estimate, color = term)) + scale_x_log10() + labs( x = quote(lambda), y = "Standardized Coefficients", title = "Lasso solution path for Credit data") ``` ::: ::: {#fig-lasso-credit}![](ISL_fig_6_6.pdf){width=600px height=350px}

Lasso solution path for the `Credit` data (ISL Figure 6.6). ::: - Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? One can show that the lasso and ridge regression coefficient estimates solve the problems $$ \text{minimize } \text{RSS} \quad \text{ subject to } \sum_{j=1}^p |\beta_j| \le s $$ and $$ \text{minimize } \text{RSS} \quad \text{ subject to } \sum_{j=1}^p \beta_j^2 \le s $$ respectively. ::: {#fig-ridge-lasso-contour}![](ISL_fig_6_7.pdf){width=600px height=450px}

Contours of the RSS and constraint functions for the lasso (left) and ridge regression (right). The solid blue areas are the constraint regions, $|\beta_1| + |\beta_2| \le s$ and $\beta_1^2 + \beta_2^2 \le s$, while the red ellipses are the contours of the RSS. ::: ### Comparing ridge and lasso - Simulation example with $n=50$ and $p=45$. All 45 predictors are related to response. ::: {#fig-ridge-lasso-compare-1}![](ISL_fig_6_8.pdf){width=600px height=350px}

Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their $R^2$ on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest. ::: - Simulation example with $n=50$ and $p=45$. Only 2 predictors are related to response. ::: {#fig-ridge-lasso-compare-2}![](ISL_fig_6_9.pdf){width=600px height=350px}

Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. The simulated data is similar, except that now only two predictors are related to the response. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their $R^2$ on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest. ::: - These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. - In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors. - However, the number of predictors that is related to the response is never known _a priori_ for real data sets. - A technique such as **cross-validation** can be used in order to determine which approach is better on a particular data set. ### Selecting the tuning parameter for ridge regression and lasso - Similar to subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best. - That is, we require a method selecting a value for the tuning parameter $\lambda$ or equivalently, the value of the constraint $s$. - **Cross-validation** provides a simple way to tackle this problem. We choose a grid of $\lambda$ values, and compute the cross-validation error rate for each value of $\lambda$. - We then select the tuning parameter value for which the cross-validation error is smallest. - Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. ::: {#fig-ridge-lasso-credit}![](ISL_fig_6_12.pdf){width=600px height=350px}

Left: Cross-validation errors that result from applying ridge regression to the `Credit` data set with various values of $\lambda$. Right: The coefficient estimates as a function of $\lambda$. The vertical dashed lines indicate the value of $\lambda$ selected by cross-validation. (ISL Figure 6.12) ::: ## Dimension reduction methods - The methods that we have discussed so far have involved fitting linear regression models, via least squares or a shrunken approach, **using the original predictors**, $X_1, X_2, \ldots, X_p$. - We now explore a class of approaches that transform the predictors and then fit a least squares model using the **transformed variables**. We will refer to these techniques as **dimension reduction** methods. - Let $Z_1, Z_2, \ldots, Z_M$ represent $M < p$ **linear combinations** of our original $p$ predictors. That is $$ Z_m = \sum_{j=1}^p \phi_{mj} X_j $$ {#eq-dimred-zs} for some constants $\phi_{m1}, \ldots, \phi_{mp}$. - We can then fit the linear regression model, $$ y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i, \quad i = 1,\ldots,n, $$ {#eq-dimred-model} using ordinary least squares. Here $\theta_0, \theta_1, \ldots, \theta_M$ are the regression coefficients. - Notice that $$ \sum_{m=1}^M \theta_m z_{im} = \sum_{m=1}^M \theta_m \sum_{j=1}^p \phi_{mj} x_{ij} = \sum_{j=1}^p \sum_{m=1}^m \theta_m \phi_{mj} x_{ij} = \sum_{j=1}^p \beta_j x_{ij}, $$ where $$ \beta_j = \sum_{m=1}^m \theta_m \phi_{mj}. $$ {#eq-dimred-constraint} - Hence model @eq-dimred-model can be thought of as a special case of the original linear regression model. - Dimension reduction serves to constrain the estimated $\beta_j$ coefficients, since now they must take the form @eq-dimred-constraint. - Can win in the bias-variance tradeoff. ### Principal components regression - **Principal components regression (PCR)** applies principal components analysis (PCA) to define the linear combinations of the predictors, for use in our regression. - The first principal component $Z_1$ is the (normalized) linear combination of the variables with the largest variance. - The second principal component $Z_2$ has largest variance, subject to being uncorrelated with the first. - And so on. - Hence with many correlated original variables, we replace them with a small set of principal components that capture their joint variation. ::: {#fig-advertising-pc-1}![](ISL_fig_6_14.pdf){width=500px height=400px}

The population size (`pop`) and ad spending (`ad`) for 100 different cities are shown as purple circles. The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component. ::: - The first principal component $$ Z_1 = 0.839 \times (\text{pop} - \bar{\text{pop}}) + 0.544 \times (\text{ad} - \bar{\text{ad}}). $$ - Out of all possible linear combination of `pop` and `ad` such that $\phi_{11}^2 + \phi_{21}^2 = 1$ (why do we need this constraint?), this particular combination yields the highest variance. - There is also another interpretation for PCA: the first principal component vector defines the line that is as close as possible to the data. ::: {#fig-advertising-pc-2}![](ISL_fig_6_15.pdf){width=800px height=350px}

A subset of the advertising data. Left: The first principal component, chosen to minimize the sum of the squared perpendicular distances to each point, is shown in green. These distances are represented using the black dashed line segments. Right: The left-hand panel has been rotated so that the first principal component lies on the $x$-axis. ::: - The first principal component appears to capture most of the information contained in the `pop` and `ad` predictors. ::: {#fig-advertising-pc-3}![](ISL_fig_6_16.pdf){width=800px height=350px}

Plots of the first principal component scores $z_{i1}$ versus `pop` and `ad`. The relationships are strong. ::: - The second principal component $$ Z_2 = 0.544 \times (\text{pop} - \bar{\text{pop}}) - 0.839 \times (\text{ad} - \bar{\text{ad}}). $$ ::: {#fig-advertising-pc-3}![](ISL_fig_6_17.pdf){width=800px height=350px}

Plots of the second principal component scores $z_{i2}$ versus `pop` and `ad`. The relationships are weak. ::: - PCR applied to two simulation data sets with $n=50$ and $p=45$: ::: {#fig-sim-pcr}![](ISL_fig_6_18.pdf){width=800px height=350px}

PCR was applied to two simulated data sets. The black, green, and purple lines correspond to squared bias, variance, and test mean squared error, respectively. Left: all 45 predictors are related to response. Right: Only 2 out of 45 predictors are related to response. ::: - PCR applied to a data set where the first 5 principal components are related to response. ::: {#fig-sim-pcr}![](ISL_fig_6_19.pdf){width=800px height=350px}

PCR, ridge regression, and the lasso were applied to a simulated data set in which the first five principal components of $X$ contain all the information about the response $Y$. In each panel, the irreducible error $\operatorname{Var}(\epsilon)$ is shown as a horizontal dashed line. Left: Results for PCR. Right: Results for lasso (solid) and ridge regression (dotted). The $x$-axis displays the shrinkage factor of the coefficient estimates, defined as the $\ell_2$ norm of the shrunken coefficient estimates divided by the $\ell_2$ norm of the least squares estimate. ::: - PCR applied to the `Credit` data: ::: {#fig-credit-pcr}![](ISL_fig_6_20.pdf){width=800px height=400px}

PCR was applied to the `Credit` data. ::: ### Partial least squares (PLS) - PCR identifies linear combinations, or **directions**, that best represent the predictors $X_1, \ldots, X_p$. - These directions are identified in an **unsupervised way**, since the response $Y$ is not used to help determine the principal component directions. - That is, the response does not **supervise** the identification of the principal components. - Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response. - Like PCR, partial least squares (PLS) is a dimension reduction method, which first identifies a new set of features $Z_1, \ldots, Z_M$ that are linear combinations of the original features, and then fits a linear model via OLS using these $M$ new features. - But unlike PCR, PLS identifies these new features in a supervised way. That is it makes use of the response $Y$ in order to identify new features that not only approximate the old features well, but also that **are related to the response**. - Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors. - PLS algorithm: - After standardizing the $p$ predictors, PLS computes the first direction $Z_1$ by setting each $\phi_{1j}$ in @eq-dimred-zs equal to the coefficient from the simple linear regression of $Y$ onto $X_j$. - One can show that this coefficient is proportional to the correlation between $Y$ and $X_j$. - Hence, in computing $Z_1 = \sum_{j=1}^p \phi_{1j} X_j$, PLS places the highest weight on the variables that are most strongly related to the response. - Subsequent directions are found by taking residuals and then repeating the above procedure. ## Summary - Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors. - Research into methods that give **sparsity**, such as the lasso is an especially hot area. - Later, we will return to sparsity in more detail, and will describe related approaches such as the **elastic net**.