--- title: "Linear Regression (ISL 3)" subtitle: "Econ 425T" author: "Dr. Hua Zhou @ UCLA" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Credit: This note heavily uses material from the books [_An Introduction to Statistical Learning: with Applications in R_](https://www.statlearning.com/) (ISL2) and [_Elements of Statistical Learning: Data Mining, Inference, and Prediction_](https://hastie.su.domains/ElemStatLearn/) (ESL2). Display system information for reproducibility. ::: {.panel-tabset} ## Python ```{python} import IPython print(IPython.sys_info()) ``` ## R ```{r} sessionInfo() ``` ## Julia ```{julia} #| eval: false using InteractiveUtils versioninfo() ``` ::: ## Linear regression - This lecture serves as a review of your Econ 430+442A material. - Linear regression is a simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1, X_2, \ldots, X_p$ is linear. - True regression functions are never linear! - Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. > Essentially, all models are wrong, but some are useful. > > George Box ## `Advertising` data - Response variable: `sales` (in thousands of units). - Predictor variables: - `TV`: budget for TV advertising (in thousands of dollars). - `radio`: budget for radio advertising (in thousands of dollars). - `newspaper`: budget for newspaper advertising (in thousands of dollars). ::: {.panel-tabset} #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font sizes in plots sns.set(font_scale = 2) # Display all columns pd.set_option('display.max_columns', None) # Import Advertising data Advertising = pd.read_csv("../data/Advertising.csv", index_col = 0) Advertising ``` ```{python} # Visualization by pair plot sns.pairplot(data = Advertising) ``` #### R ```{r} library(GGally) # ggpairs function library(tidyverse) # Advertising Advertising <- read_csv("../data/Advertising.csv", col_select = TV:sales) %>% print(width = Inf) # Visualization by pair plot ggpairs( data = Advertising, mapping = aes(alpha = 0.25), lower = list(continuous = "smooth") ) + labs(title = "Advertising Data") ``` #### Julia ::: - A few questions we might ask - Is there a relationship between advertising budget and sales? - How strong is the relationship between advertising budget and sales? - Which media contribute to sales? - How accurately can we predict future sales? - Is the relationship linear? - Is there synergy among the advertising media? ## Simple linear regression vs multiple linear regression - In **simple linear regression**, we assume a model $$ Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \epsilon. $$ For example $$ \text{sale} \approx \beta_0 + \beta_1 \times \text{TV}. $$ The lower left plot in the pair plot (R) displays the fitted line $$ \hat y = \hat \beta_0 + \hat \beta_1 \times \text{TV}. $$ - In **multiple linear regression**, we assume the model $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon. $$ For example, $$ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \cdots + \beta_3 \times \text{newspaper}. $$ We review the multiple linear regression below. ## Estimation in multiple linear regression - We estimate regression coefficients $\beta_0, \beta_1, \ldots, \beta_p$ by the **method of least squares**, which minimizes the sum of squared residuals $$ \text{RSS} = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2. $$ - The minimizer $\hat \beta_0, \hat \beta_1, \ldots, \hat \beta_p$ admits the analytical solution $$ \hat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. $$ and we can make prediction using the formula $$ \hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p. $$ ::: {.panel-tabset} #### Python (sklearn) ```{python} # sklearn does not offer much besides coefficient estimate from sklearn.linear_model import LinearRegression X = Advertising[['TV', 'radio', 'newspaper']] y = Advertising['sales'] lmod = LinearRegression().fit(X, y) lmod.intercept_ lmod.coef_ # Score is the R^2 lmod.score(X, y) ``` #### Python (statsmodels) ```{python} import statsmodels.api as sm import statsmodels.formula.api as smf # Fit linear regression lmod = smf.ols(formula = 'sales ~ TV + radio + newspaper', data = Advertising).fit() lmod.summary() ``` #### R ```{r} library(gtsummary) # Linear regression lmod <- lm(sales ~ TV + radio + newspaper, data = Advertising) summary(lmod) # Better summary table lmod %>% tbl_regression() %>% bold_labels() %>% bold_p(t = 0.05) ``` #### Julia ::: ## Interpreting regression coefficients - We interpret $\beta_j$ as the average effect on $Y$ of a a one-unit increase in $X_j$, _holding all other predictors fixed_. > a regression coefficient $\beta_j$ estimates the expected change in $Y$ per unit change in $X_j$, with all other predictors held fixed. But predictors usually change together! > > "Data Analysis and Regression", Mosteller and Tukey 1977 - Only when the predictors are uncorrelated (balanced or orthogonal design), each coefficient can be estimated, interpreted, and tested separately. In other words, only in a balanced design, the regression coefficients estimated from multiple linear regression are the same as those from separate simple linear regressions. - Claims of **causality** should be avoided for observational data. ## Some important questions ### Goodness of fit **Question:** Is at least one of the predictors $X_1, X_2, \ldots, X_p$ useful in predicting the response? - We can use the F-statistic to assess the fit of the overall model $$ F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p, n-p-1}, $$ where $$ \text{TSS} = \sum_{i=1}^n (y_i - \bar y)^2. $$ Formally we are testing the null hypothesis $$ H_0: \beta_1 = \cdots = \beta_p = 0 $$ versus the alternative $$ H_a: \text{at least one } \beta_j \text{ is non-zero}. $$ - In general, the F-statistic for testing $$ H_0: \text{ a restricted model } \omega $$ versus $$ H_a: \text{a more general model } \Omega $$ takes the form $$ F = \frac{(\text{RSS}_\omega - \text{RSS}_\Omega) / (\text{df}(\Omega) - \text{df}(\omega))}{\text{RSS}_\Omega / (n - \text{df}(\Omega))}, $$ where df is the degree of freedom (roughly speaking the number of free parameters) of a model. ::: {.panel-tabset} #### Python (statsmodels) ```{python} # F test for R \beta = 0 R = [[0, 1, 0 , 0], [0, 0, 1, 0], [0, 0, 0, 1]] lmod.f_test(R) ``` #### R ```{r} lmod_null <- lm(sales ~ 1, data = Advertising) anova(lmod_null, lmod) ``` #### Julia ::: ### Model selection **Question:** Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful? - For a naive implementation of the best subset, forward selection, and backward selection regressions in Python, interesting readers can read this [blog](https://xavierbourretsicotte.github.io/subset_selection.html). #### Best subset regression - The most direct approach is called **all subsets** or **best subsets** regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. - We try the highly-optimized [`abess`](https://abess.readthedocs.io/en/latest/index.html) package here. ::: {.panel-tabset} #### Python ```{python} from abess.linear import LinearRegression from patsy import dmatrices # Create design matrix and response vector using R-like formula ym, Xm = dmatrices( 'sales ~ TV + radio + newspaper', data = Advertising, return_type = 'matrix' ) lmod_abess = LinearRegression(support_size = range(3)) lmod_abess.fit(Xm, ym) lmod_abess.coef_ ``` #### R ```{r} library(leaps) bs_mod <- regsubsets( sales ~ ., data = Advertising, method = "exhaustive" ) %>% summary() bs_mod tibble( K = 1:3, R2 = bs_mod$rsq, adjR2 = bs_mod$adjr2, BIC = bs_mod$bic, CP = bs_mod$cp ) %>% print(width = Inf) ``` #### Julia ::: - However we often can't examine all possible models, since there are $2^p$ of them. For example, when $p = 30$, there are $2^{30} = 1,073,741,824$ models! - Recent advances make possible to solve best subset regression with $p \sim 10^3 \sim 10^5$ with high-probability guarantee of optimality. See this [article](https://doi.org/10.1073/pnas.2014241117). - For really big $p$, we need some heuristic approaches to search through a subset of them. We discuss three commonly use approaches next. #### Forward selection - Begin with the **null model**, a model that contains an intercept but no predictors. - Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS (equivalently the lowest AIC). - Add to that model the variable that results in the lowest RSS among all two-variable models. - Continue until some stopping rule is satisfied, for example when the AIC does not decrease anymore. ::: {.panel-tabset} #### R ```{r} step(lmod_null, scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper), trace = TRUE, direction = "forward" ) ``` #### Julia ::: #### Backward selection - Start with all variables in the model. - Remove the variable with the largest $p$-value. That is the variable that is the least statistically significant. - The new $(p-1)$-variable model is fit, and the variable with the largest p-value is removed. - Continue until a stopping rule is reached. For instance, we may stop when AIC does not decrease anymore, or when all remaining variables have a significant $p$-value defined by some significance threshold. - Backward selection cannot start with a model with $p > n$. ::: {.panel-tabset} #### R ```{r} step(lmod, scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper), trace = TRUE, direction = "backward" ) ``` #### Julia ::: #### Mixed selection - We alternately perform forward and backward steps until all variables in the model have a sufficiently low $p$-value, and all variables outside the model would have a large $p$-value if added to the model. ::: {.panel-tabset} #### R ```{r} step(lmod_null, scope = list(lower = sale ~ 1, upper = sale ~ TV + radio + newspaper), trace = TRUE, direction = "both" ) ``` #### Julia ::: #### Other model selection criteria Commonly used model selection criteria include Mallow's $C_p$, Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted $R^2$, and cross-validation (CV). ### Model fit **Question:** How well does the selected model fit the data? - **$R^2$**: the fraction of variance explained $$ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}. $$ It turns out (HW1 bonus question) $$ R^2 = \operatorname{Cor}(Y, \hat Y)^2. $$ Adding more predictors into a model **always** increase $R^2$. - From computer output, we see the fitted linear model by using `TV`, `radio`, and `newspaper` has an $R^2=0.8972$. ::: {.panel-tabset} #### Python ```{python} # R^2 lmod.rsquared # Cor(Y, \hat Y)^2 np.corrcoef(Advertising['sales'], lmod.predict()) np.corrcoef(Advertising['sales'], lmod.predict())[0, 1]**2 ``` #### R ```{r} # R^2 summary(lmod)$r.squared # Cor(Y, \hat Y)^2 cor(predict(lmod), Advertising$sales)^2 ``` #### Julia ::: - The **adjusted $R_a^2$** adjusts to the fact that a larger model also uses more parameters. $$ R_a^2 = 1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)}. $$ Adding a predictor will only increase $R_a^2$ if it has some predictive value. ::: {.panel-tabset} #### Python ```{python} # Adjusted R^2 lmod.rsquared_adj ``` #### R ```{r} # Adjusted R^2 summary(lmod)$adj.r.squared ``` #### Julia ::: - The **rooted squared error (RSE)** $$ \text{RSE} = \sqrt{\frac{\text{RSS}}{n - p - 1}} $$ is an estimator of the noise standard error $\sqrt{\operatorname{Var}(\epsilon)}$. The smaller RSE, the better fit of the model. ### Prediction **Question:** Given a set of predictor values, what response value should we predict, and how accurate is our prediction? - Given a new set of predictors $x$, we predict $y$ by $$ \hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p. $$ - Under the selected model $$ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio}, $$ how many units of sale do we expect with \$100k budget in `TV` and \$20k budget in `radio`? ::: {.panel-tabset} #### Python ```{python} # Create design matrix and response vector using R-like formula y_small, X_small = dmatrices( 'sales ~ TV + radio', data = Advertising, return_type = 'dataframe' ) # Fit linear regression lmod_small = sm.OLS(y_small, X_small).fit() lmod_small_pred = lmod_small.get_prediction(pd.DataFrame({'Intercept': [1], 'TV': [100], 'radio': [20]})).summary_frame() lmod_small_pred ``` #### R ```{r} x_new <- list(TV = 100, radio = 20) lmod_small <- lm(sales ~ TV + radio, data = Advertising) predict(lmod_small, x_new, interval = "confidence") predict(lmod_small, x_new, interval = "prediction") ``` #### Julia ::: - 95\% **Confidence interval**: probability of covering the true $f(X)$ is 0.95. - 95\% **Predictive interval**: probability of covering the true $Y$ is 0.95. Predictive interval is always wider than the confidence interval because it incorporates the randomness in both $\hat \beta$ and $\epsilon$. ## Qualitative (or categorical) predictors - In the `Advertising` data, all predictors are quantitative (or continuous). - The `Credit` data has a mix of quantitative (`Balance`, `Age`, `Cards`, `Education`, `Income`, `Limit`, `Rating`) and qualitative predictors (`Own`, `Student`, `Married`, `Region`). ::: {.panel-tabset} #### Python ```{python} # Import Credit data Credit = pd.read_csv("../data/Credit.csv") Credit ``` ```{python} # Visualization by pair plot sns.pairplot(data = Credit) ``` #### R ```{r} library(ISLR2) # Credit Credit <- as_tibble(Credit) %>% print(width = Inf) # Pair plot of continuous predictors ggpairs( data = Credit[, c(1:6, 11)], mapping = aes(alpha = 0.25) ) + labs(title = "Credit Data") # Pair plot of categorical predictors ggpairs( data = Credit[, c(7:10, 11)], mapping = aes(alpha = 0.25) ) + labs(title = "Credit Data") ``` #### Julia ::: - By default, most software create **dummy variables**, also called **one-hot coding**, for categorical variables. Both Python (statsmodels) and R use the first level (in alphabetical order) as the base level. ::: {.panel-tabset} #### Python ```{python} # Hack to get all predictors my_formula = "Balance ~ " + "+".join(Credit.columns.difference(["Balance"])) # Create design matrix and response vector using R-like formula y, X = dmatrices( my_formula, data = Credit, return_type = 'dataframe' ) # One-hot coding for categorical variables X ``` ```{python} # Fit linear regression sm.OLS(y, X).fit().summary() ``` #### R ```{r} lm(Balance ~ ., data = Credit) %>% tbl_regression() %>% bold_labels() %>% bold_p(t = 0.05) ``` #### Julia ::: - There are many different ways of coding qualitative variables, measuring particular **contrasts**. ## Interactions - In our previous analysis of the `Advertising` data, we assumed that the effect on `sales` of increasing one advertising medium is independent of the amount spent on the other media $$ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio}. $$ - But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for `TV` should increase as `radio` increases. - In this situation, given a fixed budget of $100,000, spending half on `radio` and half on `TV` may increase sales more than allocating the entire amount to either `TV` or to `radio`. - In marketing, this is known as a **synergy** effect, and in statistics it is referred to as an **interaction** effect. - Assume the interaction model: $$ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times (\text{TV} \times \text{radio}). $$ ::: {.panel-tabset} #### Python ```{python} # Design matrix with interaction y, X_int = dmatrices( 'sales ~ 1 + TV * radio', data = Advertising, return_type = 'dataframe' ) # Fit linear regression with interaction sm.OLS(y, X_int).fit().summary() ``` #### R ```{r} lmod_int <- lm(sales ~ 1 + TV * radio, data = Advertising) summary(lmod_int) lmod_int %>% tbl_regression(estimate_fun = function(x) style_sigfig(x, digits = 4)) %>% bold_labels() %>% bold_p(t = 0.05) ``` #### Julia ::: - The results in this table suggests that the TV $\times$ Radio interaction is important. - The $R^2$ for the interaction model is 96.8\%, compared to only 89.7\% for the model that predicts sales using `TV` and `radio` without an interaction term. - This means that (96.8 - 89.7)/(100 - 89.7) = 69\% of the variability in sales that remains after fitting the additive model has been explained by the interaction term. - The coefficient estimates in the table suggest that an increase in TV advertising of \$1,000 is associated with increased sales of $$ (\hat \beta_1 + \hat \beta_3 \times \text{radio}) \times 1000 = 19 + 1.1 \times \text{radio units}. $$ - An increase in radio advertising of \$1,000 will be associated with an increase in sales of $$ (\hat \beta_2 + \hat \beta_3 \times \text{TV}) \times 1000 = 29 + 1.1 \times \text{TV units}. $$ - Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, `TV` and `radio`) do not. The **hierarchy principle** staties > If we include an interaction in a model, we should also include the main effects, even if the $p$-values associated with their coefficients are not significant. - The rationale for this principle is that interactions are hard to interpret in a model without main effects -- their meaning is changed. Specifically, the interaction terms still contain main effects, if the model has no main effect terms. ## Non-linear effects of predictors - `Auto` data. - Linear fit vs polynomial fit. ::: {.panel-tabset} #### Python ```{python} # Import Auto data (missing data represented by '?') Auto = pd.read_csv("../data/Auto.csv", na_values = '?').dropna() Auto ``` ```{python} plt.clf() # Linear fit ax = sns.regplot( data = Auto, x = 'horsepower', y = 'mpg', order = 1 ) # Quadratic fit sns.regplot( data = Auto, x = 'horsepower', y = 'mpg', order = 2, ax = ax ) # 5-degree fit sns.regplot( data = Auto, x = 'horsepower', y = 'mpg', order = 5, ax = ax ) ax.set( xlabel = 'Horsepower', ylabel = 'Miles per gallon' ) ``` #### R ```{r} Auto <- as_tibble(Auto) %>% print(width = Inf) # Linear fit lm(mpg ~ horsepower, data = Auto) %>% summary() # Degree 2 lm(mpg ~ poly(horsepower, 2), data = Auto) %>% summary() # Degree 5 lm(mpg ~ poly(horsepower, 5), data = Auto) %>% summary() ggplot(data = Auto, mapping = aes(x = horsepower, y = mpg)) + geom_point() + geom_smooth(method = lm, color = 1) + geom_smooth(method = lm, formula = y ~ poly(x, 2), color = 2) + geom_smooth(method = lm, formula = y ~ poly(x, 5), color = 3) + labs(x = "Horsepower", y = "Miles per gallon") ``` #### Julia ::: ## What we did not cover - Outliers. - Non-constant variance of error terms. - High leverage points. - Collinearity. See ISL Section 3.3.3. ## Generalizations of linear models In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit. - Classification problems: logistic regression, discriminant analysis, support vector machines. - Non-linearity: kernel smoothing, splines and generalized additive models, nearest neighbor methods. - Interactions: tree-based methods, bagging, random forests, boosting, neural networks. - Regularized fitting: ridge regression and lasso.