--- title: "Linear Regression (ISL 3)" subtitle: "Econ 425T" author: "Dr. Hua Zhou @ UCLA" date: "r format(Sys.time(), '%d %B, %Y')" format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Credit: This note heavily uses material from the books [_An Introduction to Statistical Learning: with Applications in R_](https://www.statlearning.com/) (ISL2) and [_Elements of Statistical Learning: Data Mining, Inference, and Prediction_](https://hastie.su.domains/ElemStatLearn/) (ESL2). Display system information for reproducibility. ::: {.panel-tabset} ## Python {python} import IPython print(IPython.sys_info())  ## R {r} sessionInfo()  ## Julia {julia} #| eval: false using InteractiveUtils versioninfo()  ::: ## Linear regression - This lecture serves as a review of your Econ 430+442A material. - Linear regression is a simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1, X_2, \ldots, X_p$ is linear. - True regression functions are never linear! - Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. > Essentially, all models are wrong, but some are useful. > > George Box ## Advertising data - Response variable: sales (in thousands of units). - Predictor variables: - TV: budget for TV advertising (in thousands of dollars). - radio: budget for radio advertising (in thousands of dollars). - newspaper: budget for newspaper advertising (in thousands of dollars). ::: {.panel-tabset} #### Python {python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font sizes in plots sns.set(font_scale = 2) # Display all columns pd.set_option('display.max_columns', None) # Import Advertising data Advertising = pd.read_csv("../data/Advertising.csv", index_col = 0) Advertising  {python} # Visualization by pair plot sns.pairplot(data = Advertising)  #### R {r} library(GGally) # ggpairs function library(tidyverse) # Advertising Advertising <- read_csv("../data/Advertising.csv", col_select = TV:sales) %>% print(width = Inf) # Visualization by pair plot ggpairs( data = Advertising, mapping = aes(alpha = 0.25), lower = list(continuous = "smooth") ) + labs(title = "Advertising Data")  #### Julia ::: - A few questions we might ask - Is there a relationship between advertising budget and sales? - How strong is the relationship between advertising budget and sales? - Which media contribute to sales? - How accurately can we predict future sales? - Is the relationship linear? - Is there synergy among the advertising media? ## Simple linear regression vs multiple linear regression - In **simple linear regression**, we assume a model $$Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \epsilon.$$ For example $$\text{sale} \approx \beta_0 + \beta_1 \times \text{TV}.$$ The lower left plot in the pair plot (R) displays the fitted line $$\hat y = \hat \beta_0 + \hat \beta_1 \times \text{TV}.$$ - In **multiple linear regression**, we assume the model $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon.$$ For example, $$\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \cdots + \beta_3 \times \text{newspaper}.$$ We review the multiple linear regression below. ## Estimation in multiple linear regression - We estimate regression coefficients $\beta_0, \beta_1, \ldots, \beta_p$ by the **method of least squares**, which minimizes the sum of squared residuals $$\text{RSS} = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2.$$ - The minimizer $\hat \beta_0, \hat \beta_1, \ldots, \hat \beta_p$ admits the analytical solution $$\hat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}.$$ and we can make prediction using the formula $$\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p.$$ ::: {.panel-tabset} #### Python (sklearn) {python} # sklearn does not offer much besides coefficient estimate from sklearn.linear_model import LinearRegression X = Advertising[['TV', 'radio', 'newspaper']] y = Advertising['sales'] lmod = LinearRegression().fit(X, y) lmod.intercept_ lmod.coef_ # Score is the R^2 lmod.score(X, y)  #### Python (statsmodels) {python} import statsmodels.api as sm import statsmodels.formula.api as smf # Fit linear regression lmod = smf.ols(formula = 'sales ~ TV + radio + newspaper', data = Advertising).fit() lmod.summary()  #### R {r} library(gtsummary) # Linear regression lmod <- lm(sales ~ TV + radio + newspaper, data = Advertising) summary(lmod) # Better summary table lmod %>% tbl_regression() %>% bold_labels() %>% bold_p(t = 0.05)  #### Julia ::: ## Interpreting regression coefficients - We interpret $\beta_j$ as the average effect on $Y$ of a a one-unit increase in $X_j$, _holding all other predictors fixed_. > a regression coefficient $\beta_j$ estimates the expected change in $Y$ per unit change in $X_j$, with all other predictors held fixed. But predictors usually change together! > > "Data Analysis and Regression", Mosteller and Tukey 1977 - Only when the predictors are uncorrelated (balanced or orthogonal design), each coefficient can be estimated, interpreted, and tested separately. In other words, only in a balanced design, the regression coefficients estimated from multiple linear regression are the same as those from separate simple linear regressions. - Claims of **causality** should be avoided for observational data. ## Some important questions ### Goodness of fit **Question:** Is at least one of the predictors $X_1, X_2, \ldots, X_p$ useful in predicting the response? - We can use the F-statistic to assess the fit of the overall model $$F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p, n-p-1},$$ where $$\text{TSS} = \sum_{i=1}^n (y_i - \bar y)^2.$$ Formally we are testing the null hypothesis $$H_0: \beta_1 = \cdots = \beta_p = 0$$ versus the alternative $$H_a: \text{at least one } \beta_j \text{ is non-zero}.$$ - In general, the F-statistic for testing $$H_0: \text{ a restricted model } \omega$$ versus $$H_a: \text{a more general model } \Omega$$ takes the form $$F = \frac{(\text{RSS}_\omega - \text{RSS}_\Omega) / (\text{df}(\Omega) - \text{df}(\omega))}{\text{RSS}_\Omega / (n - \text{df}(\Omega))},$$ where df is the degree of freedom (roughly speaking the number of free parameters) of a model. ::: {.panel-tabset} #### Python (statsmodels) {python} # F test for R \beta = 0 R = [[0, 1, 0 , 0], [0, 0, 1, 0], [0, 0, 0, 1]] lmod.f_test(R)  #### R {r} lmod_null <- lm(sales ~ 1, data = Advertising) anova(lmod_null, lmod)  #### Julia ::: ### Model selection **Question:** Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful? - For a naive implementation of the best subset, forward selection, and backward selection regressions in Python, interesting readers can read this [blog](https://xavierbourretsicotte.github.io/subset_selection.html). #### Best subset regression - The most direct approach is called **all subsets** or **best subsets** regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. - We try the highly-optimized [abess](https://abess.readthedocs.io/en/latest/index.html) package here. ::: {.panel-tabset} #### Python {python} from abess.linear import LinearRegression from patsy import dmatrices # Create design matrix and response vector using R-like formula ym, Xm = dmatrices( 'sales ~ TV + radio + newspaper', data = Advertising, return_type = 'matrix' ) lmod_abess = LinearRegression(support_size = range(3)) lmod_abess.fit(Xm, ym) lmod_abess.coef_  #### R {r} library(leaps) bs_mod <- regsubsets( sales ~ ., data = Advertising, method = "exhaustive" ) %>% summary() bs_mod tibble( K = 1:3, R2 = bs_mod$rsq, adjR2 = bs_mod$adjr2, BIC = bs_mod$bic, CP = bs_mod$cp ) %>% print(width = Inf)  #### Julia ::: - However we often can't examine all possible models, since there are $2^p$ of them. For example, when $p = 30$, there are $2^{30} = 1,073,741,824$ models! - Recent advances make possible to solve best subset regression with $p \sim 10^3 \sim 10^5$ with high-probability guarantee of optimality. See this [article](https://doi.org/10.1073/pnas.2014241117). - For really big $p$, we need some heuristic approaches to search through a subset of them. We discuss three commonly use approaches next. #### Forward selection - Begin with the **null model**, a model that contains an intercept but no predictors. - Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS (equivalently the lowest AIC). - Add to that model the variable that results in the lowest RSS among all two-variable models. - Continue until some stopping rule is satisfied, for example when the AIC does not decrease anymore. ::: {.panel-tabset} #### R {r} step(lmod_null, scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper), trace = TRUE, direction = "forward" )  #### Julia ::: #### Backward selection - Start with all variables in the model. - Remove the variable with the largest $p$-value. That is the variable that is the least statistically significant. - The new $(p-1)$-variable model is fit, and the variable with the largest p-value is removed. - Continue until a stopping rule is reached. For instance, we may stop when AIC does not decrease anymore, or when all remaining variables have a significant $p$-value defined by some significance threshold. - Backward selection cannot start with a model with $p > n$. ::: {.panel-tabset} #### R {r} step(lmod, scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper), trace = TRUE, direction = "backward" )  #### Julia ::: #### Mixed selection - We alternately perform forward and backward steps until all variables in the model have a sufficiently low $p$-value, and all variables outside the model would have a large $p$-value if added to the model. ::: {.panel-tabset} #### R {r} step(lmod_null, scope = list(lower = sale ~ 1, upper = sale ~ TV + radio + newspaper), trace = TRUE, direction = "both" )  #### Julia ::: #### Other model selection criteria Commonly used model selection criteria include Mallow's $C_p$, Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted $R^2$, and cross-validation (CV). ### Model fit **Question:** How well does the selected model fit the data? - **$R^2$**: the fraction of variance explained $$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}.$$ It turns out (HW1 bonus question) $$R^2 = \operatorname{Cor}(Y, \hat Y)^2.$$ Adding more predictors into a model **always** increase $R^2$. - From computer output, we see the fitted linear model by using TV, radio, and newspaper has an $R^2=0.8972$. ::: {.panel-tabset} #### Python {python} # R^2 lmod.rsquared # Cor(Y, \hat Y)^2 np.corrcoef(Advertising['sales'], lmod.predict()) np.corrcoef(Advertising['sales'], lmod.predict())[0, 1]**2  #### R {r} # R^2 summary(lmod)$r.squared # Cor(Y, \hat Y)^2 cor(predict(lmod), Advertising$sales)^2  #### Julia ::: - The **adjusted $R_a^2$** adjusts to the fact that a larger model also uses more parameters. $$R_a^2 = 1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)}.$$ Adding a predictor will only increase $R_a^2$ if it has some predictive value. ::: {.panel-tabset} #### Python {python} # Adjusted R^2 lmod.rsquared_adj  #### R {r} # Adjusted R^2 summary(lmod)$adj.r.squared  #### Julia ::: - The **rooted squared error (RSE)** $$\text{RSE} = \sqrt{\frac{\text{RSS}}{n - p - 1}}$$ is an estimator of the noise standard error$\sqrt{\operatorname{Var}(\epsilon)}$. The smaller RSE, the better fit of the model. ### Prediction **Question:** Given a set of predictor values, what response value should we predict, and how accurate is our prediction? - Given a new set of predictors$x$, we predict$y$by $$\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p.$$ - Under the selected model $$\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio},$$ how many units of sale do we expect with \$100k budget in TV and \$20k budget in radio? ::: {.panel-tabset} #### Python {python} # Create design matrix and response vector using R-like formula y_small, X_small = dmatrices( 'sales ~ TV + radio', data = Advertising, return_type = 'dataframe' ) # Fit linear regression lmod_small = sm.OLS(y_small, X_small).fit() lmod_small_pred = lmod_small.get_prediction(pd.DataFrame({'Intercept': [1], 'TV': [100], 'radio': [20]})).summary_frame() lmod_small_pred  #### R {r} x_new <- list(TV = 100, radio = 20) lmod_small <- lm(sales ~ TV + radio, data = Advertising) predict(lmod_small, x_new, interval = "confidence") predict(lmod_small, x_new, interval = "prediction")  #### Julia ::: - 95\% **Confidence interval**: probability of covering the true$f(X)$is 0.95. - 95\% **Predictive interval**: probability of covering the true$Y$is 0.95. Predictive interval is always wider than the confidence interval because it incorporates the randomness in both$\hat \beta$and$\epsilon$. ## Qualitative (or categorical) predictors - In the Advertising data, all predictors are quantitative (or continuous). - The Credit data has a mix of quantitative (Balance, Age, Cards, Education, Income, Limit, Rating) and qualitative predictors (Own, Student, Married, Region). ::: {.panel-tabset} #### Python {python} # Import Credit data Credit = pd.read_csv("../data/Credit.csv") Credit  {python} # Visualization by pair plot sns.pairplot(data = Credit)  #### R {r} library(ISLR2) # Credit Credit <- as_tibble(Credit) %>% print(width = Inf) # Pair plot of continuous predictors ggpairs( data = Credit[, c(1:6, 11)], mapping = aes(alpha = 0.25) ) + labs(title = "Credit Data") # Pair plot of categorical predictors ggpairs( data = Credit[, c(7:10, 11)], mapping = aes(alpha = 0.25) ) + labs(title = "Credit Data")  #### Julia ::: - By default, most software create **dummy variables**, also called **one-hot coding**, for categorical variables. Both Python (statsmodels) and R use the first level (in alphabetical order) as the base level. ::: {.panel-tabset} #### Python {python} # Hack to get all predictors my_formula = "Balance ~ " + "+".join(Credit.columns.difference(["Balance"])) # Create design matrix and response vector using R-like formula y, X = dmatrices( my_formula, data = Credit, return_type = 'dataframe' ) # One-hot coding for categorical variables X  {python} # Fit linear regression sm.OLS(y, X).fit().summary()  #### R {r} lm(Balance ~ ., data = Credit) %>% tbl_regression() %>% bold_labels() %>% bold_p(t = 0.05)  #### Julia ::: - There are many different ways of coding qualitative variables, measuring particular **contrasts**. ## Interactions - In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media $$\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio}.$$ - But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases. - In this situation, given a fixed budget of$100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio. - In marketing, this is known as a **synergy** effect, and in statistics it is referred to as an **interaction** effect. - Assume the interaction model: $$\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times (\text{TV} \times \text{radio}).$$ ::: {.panel-tabset} #### Python {python} # Design matrix with interaction y, X_int = dmatrices( 'sales ~ 1 + TV * radio', data = Advertising, return_type = 'dataframe' ) # Fit linear regression with interaction sm.OLS(y, X_int).fit().summary()  #### R {r} lmod_int <- lm(sales ~ 1 + TV * radio, data = Advertising) summary(lmod_int) lmod_int %>% tbl_regression(estimate_fun = function(x) style_sigfig(x, digits = 4)) %>% bold_labels() %>% bold_p(t = 0.05)  #### Julia ::: - The results in this table suggests that the TV $\times$ Radio interaction is important. - The $R^2$ for the interaction model is 96.8\%, compared to only 89.7\% for the model that predicts sales using TV and radio without an interaction term. - This means that (96.8 - 89.7)/(100 - 89.7) = 69\% of the variability in sales that remains after fitting the additive model has been explained by the interaction term. - The coefficient estimates in the table suggest that an increase in TV advertising of \$1,000 is associated with increased sales of $$(\hat \beta_1 + \hat \beta_3 \times \text{radio}) \times 1000 = 19 + 1.1 \times \text{radio units}.$$ - An increase in radio advertising of \$1,000 will be associated with an increase in sales of $$(\hat \beta_2 + \hat \beta_3 \times \text{TV}) \times 1000 = 29 + 1.1 \times \text{TV units}.$$ - Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not. The **hierarchy principle** staties > If we include an interaction in a model, we should also include the main effects, even if the $p$-values associated with their coefficients are not significant. - The rationale for this principle is that interactions are hard to interpret in a model without main effects -- their meaning is changed. Specifically, the interaction terms still contain main effects, if the model has no main effect terms. ## Non-linear effects of predictors - Auto data. - Linear fit vs polynomial fit. ::: {.panel-tabset} #### Python {python} # Import Auto data (missing data represented by '?') Auto = pd.read_csv("../data/Auto.csv", na_values = '?').dropna() Auto  {python} plt.clf() # Linear fit ax = sns.regplot( data = Auto, x = 'horsepower', y = 'mpg', order = 1 ) # Quadratic fit sns.regplot( data = Auto, x = 'horsepower', y = 'mpg', order = 2, ax = ax ) # 5-degree fit sns.regplot( data = Auto, x = 'horsepower', y = 'mpg', order = 5, ax = ax ) ax.set( xlabel = 'Horsepower', ylabel = 'Miles per gallon' )  #### R {r} Auto <- as_tibble(Auto) %>% print(width = Inf) # Linear fit lm(mpg ~ horsepower, data = Auto) %>% summary() # Degree 2 lm(mpg ~ poly(horsepower, 2), data = Auto) %>% summary() # Degree 5 lm(mpg ~ poly(horsepower, 5), data = Auto) %>% summary() ggplot(data = Auto, mapping = aes(x = horsepower, y = mpg)) + geom_point() + geom_smooth(method = lm, color = 1) + geom_smooth(method = lm, formula = y ~ poly(x, 2), color = 2) + geom_smooth(method = lm, formula = y ~ poly(x, 5), color = 3) + labs(x = "Horsepower", y = "Miles per gallon")  #### Julia ::: ## What we did not cover - Outliers. - Non-constant variance of error terms. - High leverage points. - Collinearity. See ISL Section 3.3.3. ## Generalizations of linear models In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit. - Classification problems: logistic regression, discriminant analysis, support vector machines. - Non-linearity: kernel smoothing, splines and generalized additive models, nearest neighbor methods. - Interactions: tree-based methods, bagging, random forests, boosting, neural networks. - Regularized fitting: ridge regression and lasso.