---
title: "Linear Regression (ISL 3)"
subtitle: "Econ 425T"
author: "Dr. Hua Zhou @ UCLA"
date: "`r format(Sys.time(), '%d %B, %Y')`"
format:
  html:
    theme: cosmo
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

Credit: This note heavily uses material from the books [_An Introduction to Statistical Learning: with Applications in R_](https://www.statlearning.com/) (ISL2) and [_Elements of Statistical Learning: Data Mining, Inference, and Prediction_](https://hastie.su.domains/ElemStatLearn/) (ESL2).

Display system information for reproducibility.

::: {.panel-tabset}

## Python

```{python}
import IPython
print(IPython.sys_info())
```

## R

```{r}
sessionInfo()
```

## Julia

```{julia}
#| eval: false
using InteractiveUtils

versioninfo()
```

:::

## Linear regression

- This lecture serves as a review of your Econ 430+442A material.

- Linear regression is a simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1, X_2, \ldots, X_p$ is linear.

- True regression functions are never linear!

- Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

> Essentially, all models are wrong, but some are useful.
> 
> George Box

## `Advertising` data

- Response variable: `sales` (in thousands of units).

- Predictor variables:  
    - `TV`: budget for TV advertising (in thousands of dollars). 
    - `radio`: budget for radio advertising (in thousands of dollars). 
    - `newspaper`: budget for newspaper advertising (in thousands of dollars). 

::: {.panel-tabset}

#### Python

```{python}
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 2)
# Display all columns
pd.set_option('display.max_columns', None)

# Import Advertising data
Advertising = pd.read_csv("../data/Advertising.csv", index_col = 0)
Advertising
```

```{python}
# Visualization by pair plot
sns.pairplot(data = Advertising)
```

#### R

```{r}
library(GGally) # ggpairs function
library(tidyverse)

# Advertising
Advertising <- read_csv("../data/Advertising.csv",  col_select = TV:sales) %>% 
  print(width = Inf)

# Visualization by pair plot
ggpairs(
  data = Advertising, 
  mapping = aes(alpha = 0.25), 
  lower = list(continuous = "smooth")
  ) + 
  labs(title = "Advertising Data")
```

#### Julia

:::

- A few questions we might ask

    - Is there a relationship between advertising budget and sales?  
    - How strong is the relationship between advertising budget and sales?  
    - Which media contribute to sales?  
    - How accurately can we predict future sales?  
    - Is the relationship linear?  
    - Is there synergy among the advertising media?  
    
## Simple linear regression vs multiple linear regression

- In **simple linear regression**, we assume a model
$$
Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \epsilon.
$$
For example
$$
\text{sale} \approx \beta_0 + \beta_1 \times \text{TV}.
$$
The lower left plot in the pair plot (R) displays the fitted line 
$$
\hat y = \hat \beta_0 + \hat \beta_1 \times \text{TV}.
$$

- In **multiple linear regression**, we assume the model
$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon.
$$
For example,
$$
\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \cdots + \beta_3 \times \text{newspaper}.
$$
We review the multiple linear regression below.

## Estimation in multiple linear regression

- We estimate regression coefficients $\beta_0, \beta_1, \ldots, \beta_p$ by the **method of least squares**, which minimizes the sum of squared residuals
$$
\text{RSS} = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2.
$$
- The minimizer $\hat \beta_0, \hat \beta_1, \ldots, \hat \beta_p$ admits the analytical solution
$$
\hat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}.
$$
and we can make prediction using the formula
$$
\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p.
$$

::: {.panel-tabset}

#### Python (sklearn)

```{python}
# sklearn does not offer much besides coefficient estimate
from sklearn.linear_model import LinearRegression

X = Advertising[['TV', 'radio', 'newspaper']]
y = Advertising['sales']
lmod = LinearRegression().fit(X, y)
lmod.intercept_
lmod.coef_
# Score is the R^2
lmod.score(X, y)
```

#### Python (statsmodels)

```{python}
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit linear regression
lmod = smf.ols(formula = 'sales ~ TV + radio + newspaper', data = Advertising).fit()
lmod.summary()
```

#### R

```{r}
library(gtsummary)

# Linear regression
lmod <- lm(sales ~ TV + radio + newspaper, data = Advertising)
summary(lmod)
# Better summary table
lmod %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t = 0.05)
```

#### Julia

:::

## Interpreting regression coefficients

- We interpret $\beta_j$ as the average effect on $Y$ of a a one-unit increase in $X_j$, _holding all other predictors fixed_. 

> a regression coefficient $\beta_j$ estimates the expected change in $Y$ per unit change in $X_j$, with all other predictors held fixed. But predictors usually change together!
> 
> "Data Analysis and Regression", Mosteller and Tukey 1977

- Only when the predictors are uncorrelated (balanced or orthogonal design), each coefficient can be estimated, interpreted, and tested separately.

    In other words, only in a balanced design, the regression coefficients estimated from multiple linear regression are the same as those from separate simple linear regressions.

- Claims of **causality** should be avoided for observational data.

## Some important questions

### Goodness of fit

**Question:** Is at least one of the predictors $X_1, X_2, \ldots, X_p$ useful in predicting the response? 

- We can use the F-statistic to assess the fit of the overall model
$$
F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p, n-p-1},
$$
where
$$
\text{TSS} = \sum_{i=1}^n (y_i - \bar y)^2.
$$
Formally we are testing the null hypothesis
$$
H_0: \beta_1 = \cdots = \beta_p = 0
$$
versus the alternative
$$
H_a: \text{at least one } \beta_j \text{ is non-zero}.
$$

- In general, the F-statistic for testing 
$$
H_0: \text{ a restricted model } \omega
$$
versus 
$$
H_a: \text{a more general model } \Omega
$$
takes the form
$$
F = \frac{(\text{RSS}_\omega - \text{RSS}_\Omega) / (\text{df}(\Omega) - \text{df}(\omega))}{\text{RSS}_\Omega / (n - \text{df}(\Omega))},
$$
where df is the degree of freedom (roughly speaking the number of free parameters) of a model.

::: {.panel-tabset}

#### Python (statsmodels)

```{python}
# F test for R \beta = 0
R = [[0, 1, 0 , 0], [0, 0, 1, 0], [0, 0, 0, 1]]
lmod.f_test(R)
```

#### R

```{r}
lmod_null <- lm(sales ~ 1, data = Advertising)
anova(lmod_null, lmod)
```

#### Julia

:::

### Model selection

**Question:** Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?

- For a naive implementation of the best subset, forward selection, and backward selection regressions in Python, interesting readers can read this [blog](https://xavierbourretsicotte.github.io/subset_selection.html). 

#### Best subset regression

- The most direct approach is called **all subsets** or **best subsets** regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.

-  We try the highly-optimized [`abess`](https://abess.readthedocs.io/en/latest/index.html) package here.

::: {.panel-tabset}

#### Python

```{python}
from abess.linear import LinearRegression
from patsy import dmatrices

# Create design matrix and response vector using R-like formula
ym, Xm = dmatrices(
  'sales ~ TV + radio + newspaper', 
  data = Advertising, 
  return_type = 'matrix'
  )
lmod_abess = LinearRegression(support_size = range(3))
lmod_abess.fit(Xm, ym)
lmod_abess.coef_
```

#### R

```{r}
library(leaps)

bs_mod <- regsubsets(
  sales ~ ., 
  data = Advertising, 
  method = "exhaustive"
  ) %>% summary()
bs_mod
tibble(
  K = 1:3, 
  R2 = bs_mod$rsq,
  adjR2 = bs_mod$adjr2,
  BIC = bs_mod$bic,
  CP = bs_mod$cp
) %>% print(width = Inf)
```

#### Julia

:::

- However we often can't examine all possible models, since there are $2^p$ of them. For example, when $p = 30$, there are $2^{30} = 1,073,741,824$ models!

- Recent advances make possible to solve best subset regression with $p \sim 10^3 \sim 10^5$ with high-probability guarantee of optimality. See this [article](https://doi.org/10.1073/pnas.2014241117).

- For really big $p$, we need some heuristic approaches to search through a subset of them. We discuss three commonly use approaches next.

#### Forward selection

- Begin with the **null model**, a model that contains an intercept but no predictors.

- Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS (equivalently the lowest AIC).

- Add to that model the variable that results in the lowest RSS among all two-variable models.

- Continue until some stopping rule is satisfied, for example when the AIC does not decrease anymore.
    
::: {.panel-tabset}

#### R

```{r}
step(lmod_null,
     scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper),
     trace = TRUE, 
     direction = "forward"
     )
```

#### Julia

:::

#### Backward selection

- Start with all variables in the model.

- Remove the variable with the largest $p$-value. That is the variable that is the least statistically significant.

- The new $(p-1)$-variable model is fit, and the variable with the largest p-value is removed.

- Continue until a stopping rule is reached. For instance, we may stop when AIC does not decrease anymore, or when all remaining variables have a significant $p$-value defined by some significance threshold.

- Backward selection cannot start with a model with $p > n$.

::: {.panel-tabset}

#### R

```{r}
step(lmod,
     scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper),
     trace = TRUE, 
     direction = "backward"
     )
```

#### Julia

:::

#### Mixed selection

- We alternately perform forward and backward steps until all variables in the model have a sufficiently low $p$-value, and all variables outside the model would have a large $p$-value if added to the model.

::: {.panel-tabset}

#### R

```{r}
step(lmod_null,
     scope = list(lower = sale ~ 1, upper = sale ~ TV + radio + newspaper),
     trace = TRUE, 
     direction = "both"
     )
```

#### Julia

:::

#### Other model selection criteria

Commonly used model selection criteria include Mallow's $C_p$, Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted $R^2$, and cross-validation (CV).

### Model fit

**Question:** How well does the selected model fit the data?

- **$R^2$**: the fraction of variance explained
$$
R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}.
$$
It turns out (HW1 bonus question)
$$
R^2 = \operatorname{Cor}(Y, \hat Y)^2.
$$
Adding more predictors into a model **always** increase $R^2$.

- From computer output, we see the fitted linear model by using `TV`, `radio`, and `newspaper` has an $R^2=0.8972$. 

::: {.panel-tabset}

#### Python

```{python}
# R^2
lmod.rsquared
# Cor(Y, \hat Y)^2
np.corrcoef(Advertising['sales'], lmod.predict())
np.corrcoef(Advertising['sales'], lmod.predict())[0, 1]**2
```

#### R

```{r}
# R^2
summary(lmod)$r.squared
# Cor(Y, \hat Y)^2
cor(predict(lmod), Advertising$sales)^2
```

#### Julia

:::

- The **adjusted $R_a^2$** adjusts to the fact that a larger model also uses more parameters. 
$$
R_a^2 = 1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)}.
$$
Adding a predictor will only increase $R_a^2$ if it has some predictive value.

::: {.panel-tabset}

#### Python

```{python}
# Adjusted R^2
lmod.rsquared_adj
```

#### R

```{r}
# Adjusted R^2
summary(lmod)$adj.r.squared
```

#### Julia

:::


- The **rooted squared error (RSE)**
$$
\text{RSE} = \sqrt{\frac{\text{RSS}}{n - p - 1}}
$$
is an estimator of the noise standard error $\sqrt{\operatorname{Var}(\epsilon)}$. The smaller RSE, the better fit of the model.

### Prediction

**Question:** Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

- Given a new set of predictors $x$, we predict $y$ by
$$
\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p.
$$

- Under the selected model
$$
\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio},
$$
how many units of sale do we expect with \$100k budget in `TV` and \$20k budget in `radio`?

::: {.panel-tabset}

#### Python

```{python}
# Create design matrix and response vector using R-like formula
y_small, X_small = dmatrices(
  'sales ~ TV + radio', 
  data = Advertising, 
  return_type = 'dataframe'
  )
# Fit linear regression
lmod_small = sm.OLS(y_small, X_small).fit()
lmod_small_pred = lmod_small.get_prediction(pd.DataFrame({'Intercept': [1], 'TV': [100], 'radio': [20]})).summary_frame()
lmod_small_pred
```

#### R

```{r}
x_new <- list(TV = 100, radio = 20)
lmod_small <- lm(sales ~ TV + radio, data = Advertising)
predict(lmod_small, x_new, interval = "confidence")
predict(lmod_small, x_new, interval = "prediction")
```

#### Julia

:::

- 95\% **Confidence interval**: probability of covering the true $f(X)$ is 0.95. 

- 95\% **Predictive interval**: probability of covering the true $Y$ is 0.95.

    Predictive interval is always wider than the confidence interval because it incorporates the randomness in both $\hat \beta$ and $\epsilon$.
    
## Qualitative (or categorical) predictors

- In the `Advertising` data, all predictors are quantitative (or continuous). 

- The `Credit` data has a mix of quantitative (`Balance`, `Age`, `Cards`, `Education`, `Income`, `Limit`, `Rating`) and qualitative predictors (`Own`, `Student`, `Married`, `Region`).

::: {.panel-tabset}

#### Python

```{python}
# Import Credit data
Credit = pd.read_csv("../data/Credit.csv")
Credit
```

```{python}
# Visualization by pair plot
sns.pairplot(data = Credit)
```

#### R

```{r}
library(ISLR2)

# Credit
Credit <- as_tibble(Credit) %>% print(width = Inf)

# Pair plot of continuous predictors
ggpairs(
  data = Credit[, c(1:6, 11)],
  mapping = aes(alpha = 0.25)
  ) + 
  labs(title = "Credit Data")

# Pair plot of categorical predictors
ggpairs(
  data = Credit[, c(7:10, 11)],
  mapping = aes(alpha = 0.25)
  ) + 
  labs(title = "Credit Data")
```

#### Julia

:::

- By default, most software create **dummy variables**, also called **one-hot coding**, for categorical variables. Both Python (statsmodels) and R use the first level (in alphabetical order) as the base level.

::: {.panel-tabset}

#### Python

```{python}
# Hack to get all predictors
my_formula = "Balance ~ " + "+".join(Credit.columns.difference(["Balance"]))
# Create design matrix and response vector using R-like formula
y, X = dmatrices(
  my_formula,
  data = Credit, 
  return_type = 'dataframe'
  )
# One-hot coding for categorical variables  
X
```

```{python}
# Fit linear regression
sm.OLS(y, X).fit().summary()
```

#### R

```{r}
lm(Balance ~ ., data = Credit) %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t = 0.05)
```

#### Julia

:::

- There are many different ways of coding qualitative variables, measuring particular **contrasts**.

## Interactions

- In our previous analysis of the `Advertising` data, we assumed that the effect on `sales` of increasing one advertising medium is independent of the amount spent on the other media
$$
\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio}.
$$

- But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for `TV` should increase as `radio` increases.

- In this situation, given a fixed budget of $100,000, spending half on `radio` and half on `TV` may increase sales more than allocating the entire amount to either `TV` or to `radio`.

- In marketing, this is known as a **synergy** effect, and in statistics it is referred to as an **interaction** effect.

- Assume the interaction model:
$$
\text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times (\text{TV} \times \text{radio}).
$$

::: {.panel-tabset}

#### Python

```{python}
# Design matrix with interaction
y, X_int = dmatrices(
  'sales ~ 1 + TV * radio',
  data = Advertising, 
  return_type = 'dataframe'
  )

# Fit linear regression with interaction
sm.OLS(y, X_int).fit().summary()
```

#### R

```{r}
lmod_int <- lm(sales ~ 1 + TV * radio, data = Advertising) 
summary(lmod_int)
lmod_int %>%
  tbl_regression(estimate_fun = function(x) style_sigfig(x, digits = 4)) %>%
  bold_labels() %>%
  bold_p(t = 0.05)
```

#### Julia

:::

- The results in this table suggests that the TV $\times$ Radio interaction is important.

- The $R^2$ for the interaction model is 96.8\%, compared to only 89.7\% for the model that predicts sales using `TV` and `radio` without an interaction term.

- This means that (96.8 - 89.7)/(100 - 89.7) = 69\% of the variability in sales that remains after fitting the additive model has been explained by the interaction term.

- The coefficient estimates in the table suggest that an increase in TV advertising of \$1,000 is associated with increased sales of
$$
(\hat \beta_1 + \hat \beta_3 \times \text{radio}) \times 1000 = 19 + 1.1 \times \text{radio units}.
$$

- An increase in radio advertising of \$1,000 will be associated with an increase in sales of
$$
(\hat \beta_2 + \hat \beta_3 \times \text{TV}) \times 1000 = 29 + 1.1 \times \text{TV units}.
$$

- Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, `TV` and `radio`) do not.

    The **hierarchy principle** staties
    
> If we include an interaction in a model, we should also include the main effects, even if the $p$-values associated with their coefficients are not significant.

- The rationale for this principle is that interactions are hard to interpret in a model without main effects -- their meaning is changed. Specifically, the interaction terms still contain main effects, if the model has no main effect terms.

## Non-linear effects of predictors

- `Auto` data. 

- Linear fit vs polynomial fit.

::: {.panel-tabset}

#### Python

```{python}
# Import Auto data (missing data represented by '?')
Auto = pd.read_csv("../data/Auto.csv", na_values = '?').dropna()
Auto
```

```{python}
plt.clf()
# Linear fit
ax = sns.regplot(
  data = Auto,
  x = 'horsepower',
  y = 'mpg', 
  order = 1
  )
# Quadratic fit  
sns.regplot(
  data = Auto,
  x = 'horsepower',
  y = 'mpg', 
  order = 2,
  ax = ax
  )
# 5-degree fit  
sns.regplot(
  data = Auto,
  x = 'horsepower',
  y = 'mpg', 
  order = 5,
  ax = ax
  )
ax.set(
  xlabel = 'Horsepower',
  ylabel = 'Miles per gallon'
)
```

#### R

```{r}
Auto <- as_tibble(Auto) %>% print(width = Inf)

# Linear fit
lm(mpg ~ horsepower, data = Auto) %>% summary()
# Degree 2
lm(mpg ~ poly(horsepower, 2), data = Auto) %>% summary()
# Degree 5
lm(mpg ~ poly(horsepower, 5), data = Auto) %>% summary()

ggplot(data = Auto, mapping = aes(x = horsepower, y = mpg)) +
  geom_point() + 
  geom_smooth(method = lm, color = 1) + 
  geom_smooth(method = lm, formula = y ~ poly(x, 2), color = 2) + 
  geom_smooth(method = lm, formula = y ~ poly(x, 5), color = 3) +
  labs(x = "Horsepower", y = "Miles per gallon")
```

#### Julia

:::

## What we did not cover

- Outliers.

- Non-constant variance of error terms.

- High leverage points.

- Collinearity.

See ISL Section 3.3.3.

## Generalizations of linear models

In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit.

- Classification problems: logistic regression, discriminant analysis, support vector machines.  

- Non-linearity: kernel smoothing, splines and generalized additive models, nearest neighbor methods. 

- Interactions: tree-based methods, bagging, random forests, boosting, neural networks.

- Regularized fitting: ridge regression and lasso.