6.4 OLS Assumptions in Multiple Regression
In the multiple regression model we extend the three least squares assumptions of the simple regression model (see Chapter 4) and add a fourth assumption. These assumptions are presented in Key Concept 6.4. We will not go into the details of assumptions 1-3 since their ideas generalize easy to the case of multiple regressors. We will focus on the fourth assumption. This assumption rules out perfect correlation between regressors.
Key Concept 6.4
The Least Squares Assumptions in the Multiple Regression Model
The multiple regression model is given by
\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_k X_{ki} + u_i \ , \ i=1,\dots,n. \]
The OLS assumptions in the multiple regression model are an extension of the ones made for the simple regression model:
- Regressors \((X_{1i}, X_{2i}, \dots, X_{ki}, Y_i) \ , \ i=1,\dots,n\), are drawn such that the i.i.d. assumption holds.
- \(u_i\) is an error term with conditional mean zero given the regressors, i.e., \[ E(u_i\vert X_{1i}, X_{2i}, \dots, X_{ki}) = 0. \]
- Large outliers are unlikely, formally \(X_{1i},\dots,X_{ki}\) and \(Y_i\) have finite fourth moments.
- No perfect multicollinearity.
Multicollinearity
Multicollinearity means that two or more regressors in a multiple regression model are strongly correlated. If the correlation between two or more regressors is perfect, that is, one regressor can be written as a linear combination of the other(s), we have perfect multicollinearity. While strong multicollinearity in general is unpleasant as it causes the variance of the OLS estimator to be large (we will discuss this in more detail later), the presence of perfect multicollinearity makes it impossible to solve for the OLS estimator, i.e., the model cannot be estimated in the first place.
The next section presents some examples of perfect multicollinearity and demonstrates how lm() deals with them.
Examples of Perfect Multicollinearity
How does R react if we try to estimate a model with perfectly correlated regressors?
lm will produce a warning in the first line of the coefficient section of the output (1 not defined because of singularities) and ignore the regressor(s) which is (are) assumed to be a linear combination of the other(s). Consider the following example where we add another variable FracEL, the fraction of English learners, to CASchools where observations are scaled values of the observations for english and use it as a regressor together with STR and english in a multiple regression model. In this example english and FracEL are perfectly collinear. The R code is as follows.
# define the fraction of English learners
CASchools$FracEL <- CASchools$english / 100
# estimate the model
mult.mod <- lm(score ~ STR + english + FracEL, data = CASchools)
# obtain a summary of the model
summary(mult.mod)
#>
#> Call:
#> lm(formula = score ~ STR + english + FracEL, data = CASchools)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -48.845 -10.240 -0.308 9.815 43.461
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
#> STR -1.10130 0.38028 -2.896 0.00398 **
#> english -0.64978 0.03934 -16.516 < 2e-16 ***
#> FracEL NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 14.46 on 417 degrees of freedom
#> Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
#> F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
The row FracEL in the coefficients section of the output consists of NA entries since FracEL was excluded from the model.
If we were to compute OLS by hand, we would run into the same problem but no one would be helping us out! The computation simply fails. Why is this? Take the following example:
Assume you want to estimate a simple linear regression model with a constant and a single regressor \(X\). As mentioned above, for perfect multicollinearity to be present \(X\) has to be a linear combination of the other regressors. Since the only other regressor is a constant (think of the right hand side of the model equation as \(\beta_0 \times 1 + \beta_1 X_i + u_i\) so that \(\beta_1\) is always multiplied by \(1\) for every observation), \(X\) has to be constant as well. For \(\hat\beta_1\) we have
\[ \hat\beta_1 = \frac{\sum_{i = 1}^n (X_i - \bar{X})(Y_i - \bar{Y})} { \sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\widehat{Cov}(X,Y)}{\widehat{Var}(X)}. \tag{6.7} \]
The variance of the regressor \(X\) is in the denominator. Since the variance of a constant is zero, we are not able to compute this fraction and \(\hat{\beta}_1\) is undefined.
Note: In this special case the denominator in (6.7) equals zero, too. Can you show that?
Let us consider two further examples where our selection of regressors induces perfect multicollinearity. First, assume that we intend to analyze the effect of class size on test score by using a dummy variable that identifies classes which are not small (\(NS\)). We define that a school has the \(NS\) attribute when the school’s average student-teacher ratio is at least \(12\),
\[ NS = \begin{cases} 0, \ \ \ \text{if STR < 12} \\ 1 \ \ \ \text{otherwise.} \end{cases} \]
We add the corresponding column to CASchools and estimate a multiple regression model with covariates computer and english.
# if STR smaller 12, NS = 0, else NS = 1
CASchools$NS <- ifelse(CASchools$STR < 12, 0, 1)
# estimate the model
mult.mod <- lm(score ~ computer + english + NS, data = CASchools)
# obtain a model summary
summary(mult.mod)
#>
#> Call:
#> lm(formula = score ~ computer + english + NS, data = CASchools)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -49.492 -9.976 -0.778 8.761 43.798
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 663.704837 0.984259 674.319 < 2e-16 ***
#> computer 0.005374 0.001670 3.218 0.00139 **
#> english -0.708947 0.040303 -17.591 < 2e-16 ***
#> NS NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 14.43 on 417 degrees of freedom
#> Multiple R-squared: 0.4291, Adjusted R-squared: 0.4263
#> F-statistic: 156.7 on 2 and 417 DF, p-value: < 2.2e-16
Again, the output of summary(mult.mod) tells us that inclusion of NS in the regression would render the estimation infeasible. What happened here? This is an example where we made a logical mistake when defining the regressor NS: taking a closer look at \(NS\), the redefined measure for class size, reveals that there is not a single school with \(STR<12\) hence \(NS\) equals one for all observations. We can check this by printing the contents of CASchools$NS or by using the function table(), see ?table
.
CASchools$NS is a vector of \(420\) ones and our data set includes \(420\) observations. This obviously violates assumption 4 of Key Concept 6.4: the observations for the intercept are always \(1\),
\[\begin{align*} intercept = \, & \lambda \cdot NS. \end{align*}\]
\[\begin{align*} \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = \, & \lambda \cdot \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \\ \Leftrightarrow \, & \lambda = 1. \end{align*}\]
Since the regressors can be written as a linear combination of each other, we face perfect multicollinearity and R excludes NS from the model. Thus the take-away message is: think carefully about how the regressors in your models relate to each other!
Another example of perfect multicollinearity is known as the dummy variable trap. This may occur when multiple dummy variables are used as regressors. A common case for this is when dummies are used to sort the data into mutually exclusive categories. For example, suppose we have spatial information that indicates whether a school is located in the North, West, South or East of the U.S. This allows us to create the dummy variables
\[\begin{align*} North_i =& \begin{cases} 1 \ \ \text{if located in the north} \\ 0 \ \ \text{otherwise} \end{cases} \\ West_i =& \begin{cases} 1 \ \ \text{if located in the west} \\ 0 \ \ \text{otherwise} \end{cases} \\ South_i =& \begin{cases} 1 \ \ \text{if located in the south} \\ 0 \ \ \text{otherwise} \end{cases} \\ East_i =& \begin{cases} 1 \ \ \text{if located in the east} \\ 0 \ \ \text{otherwise}. \end{cases} \end{align*}\]
Since the regions are mutually exclusive, for every school \(i=1,\dots,n\) we have \[ North_i + West_i + South_i + East_i = 1. \]
We run into problems when trying to estimate a model that includes a constant and all four direction dummies in the model, e.g., \[TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_3 \times North_i + \beta_4 \times West_i + \beta_5 \times South_i +\\ \beta_6 \times East_i + u_i \tag{6.8}\] since then for all observations \(i=1,\dots,n\) the constant term is a linear combination of the dummies:
\[\begin{align} intercept = \, & \lambda_1 \cdot (North + West + South + East) \\ \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} = \, & \lambda_1 \cdot \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} \\ \Leftrightarrow \, & \lambda_1 = 1 \end{align}\]
and we have perfect multicollinearity. Thus the “dummy variable trap” means not paying attention and falsely including exhaustive dummies and a constant in a regression model.
How does lm() handle a regression like (6.8)? Let us first generate some artificial categorical data and append a new column named directions to CASchools and see how lm() behaves when asked to estimate the model.
# set seed for reproducibility
set.seed(1)
# generate artificial data on location
CASchools$direction <- sample(c("West", "North", "South", "East"),
420,
replace = T)
# estimate the model
mult.mod <- lm(score ~ STR + english + direction, data = CASchools)
# obtain a model summary
summary(mult.mod)
#>
#> Call:
#> lm(formula = score ~ STR + english + direction, data = CASchools)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -49.603 -10.175 -0.484 9.524 42.830
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 684.80477 7.54130 90.807 < 2e-16 ***
#> STR -1.08873 0.38153 -2.854 0.00454 **
#> english -0.65597 0.04018 -16.325 < 2e-16 ***
#> directionNorth 1.66314 2.05870 0.808 0.41964
#> directionSouth 0.71619 2.06321 0.347 0.72867
#> directionWest 1.79351 1.98174 0.905 0.36598
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 14.5 on 414 degrees of freedom
#> Multiple R-squared: 0.4279, Adjusted R-squared: 0.421
#> F-statistic: 61.92 on 5 and 414 DF, p-value: < 2.2e-16
Notice that R solves the problem on its own by generating and including the dummies directionNorth, directionSouth and directionWest but omitting directionEast. Of course, the omission of every other dummy instead would achieve the same. Another solution would be to exclude the constant and to include all dummies instead.
Does this mean that the information on schools located in the East is lost? Fortunately, this is not the case: exclusion of directEast just alters the interpretation of coefficient estimates on the remaining dummies from absolute to relative. For example, the coefficient estimate on directionNorth states that, on average, test scores in the North are about \(1.61\) points higher than in the East.
A last example considers the case where a perfect linear relationship arises from redundant regressors. Suppose we have a regressor \(PctES\), the percentage of English speakers in the school, which is given by
\[ PctES = 100 - PctEL\]
and both \(PctES\) and \(PctEL\) are included in a regression model. One regressor is redundant since the other one conveys the same information. Since this obviously is a case where the regressors can be written as linear combination, we end up with perfect multicollinearity, again.
Let us do this in R.
# Percentage of english speakers
CASchools$PctES <- 100 - CASchools$english
# estimate the model
mult.mod <- lm(score ~ STR + english + PctES, data = CASchools)
# obtain a model summary
summary(mult.mod)
#>
#> Call:
#> lm(formula = score ~ STR + english + PctES, data = CASchools)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -48.845 -10.240 -0.308 9.815 43.461
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
#> STR -1.10130 0.38028 -2.896 0.00398 **
#> english -0.64978 0.03934 -16.516 < 2e-16 ***
#> PctES NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 14.46 on 417 degrees of freedom
#> Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
#> F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
Once more, lm() refuses to estimate the full model using OLS and excludes PctES.
For more explanation of perfect multicollinearity and its impact on the OLS estimator in general multiple regression models using matrix notation, refer to Chapter 18.1 of the book.
Imperfect Multicollinearity
As opposed to perfect multicollinearity, imperfect multicollinearity is — to a certain extent — less of a problem. In fact, imperfect multicollinearity is the reason why we are interested in estimating multiple regression models in the first place: the OLS estimator allows us to isolate influences of correlated regressors on the dependent variable. If it was not for these dependencies, there would not be a reason to resort to a multiple regression approach and we could simply work with a single-regressor model. However, this is rarely the case in applications. We already know that ignoring dependencies among regressors which influence the outcome variable has an adverse effect on estimation results.
So when and why is imperfect multicollinearity a problem? Suppose you have the regression model
\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \tag{6.9} \]
and you are interested in estimating \(\beta_1\), the effect on \(Y_i\) of a one unit change in \(X_{1i}\), while holding \(X_{2i}\) constant. You do not know that the true model indeed includes \(X_2\). You follow some reasoning and add \(X_2\) as a covariate to the model in order to address a potential omitted variable bias. You are confident that \(E(u_i\vert X_{1i}, X_{2i})=0\) and that there is no reason to suspect a violation of the assumptions 2 and 3 made in Key Concept 6.4. If \(X_1\) and \(X_2\) are highly correlated, OLS struggles to precisely estimate \(\beta_1\). That means that although \(\hat\beta_1\) is a consistent and unbiased estimator for \(\beta_1\), it has a large variance due to \(X_2\) being included in the model. If the errors are homoskedastic, this issue can be better understood from the formula for the variance of \(\hat\beta_1\) in the model (6.9) (see Appendix 6.2 of the book):
\[ \sigma^2_{\hat\beta_1} = \frac{1}{n} \left( \frac{1}{1-\rho^2_{X_1,X_2}} \right) \frac{\sigma^2_u}{\sigma^2_{X_1}}. \tag{6.10} \]
First, if \(\rho_{X_1,X_2}=0\), i.e., if there is no correlation between both regressors, including \(X_2\) in the model has no influence on the variance of \(\hat\beta_1\). Secondly, if \(X_1\) and \(X_2\) are correlated, \(\sigma^2_{\hat\beta_1}\) is inversely proportional to \(1-\rho^2_{X_1,X_2}\) so the stronger the correlation between \(X_1\) and \(X_2\), the smaller is \(1-\rho^2_{X_1,X_2}\) and thus the bigger is the variance of \(\hat\beta_1\). Thirdly, increasing the sample size helps to reduce the variance of \(\hat\beta_1\). Of course, this is not limited to the case with two regressors: in multiple regressions, imperfect multicollinearity inflates the variance of one or more coefficient estimators. It is an empirical question which coefficient estimates are severely affected by this and which are not. When the sample size is small, one often faces the decision whether to accept the consequence of adding a large number of covariates (higher variance) or to use a model with only few regressors (possible omitted variable bias). This is called bias-variance trade-off.
In sum, undesirable consequences of imperfect multicollinearity are generally not the result of a logical error made by the researcher (as is often the case for perfect multicollinearity) but are rather a problem that is linked to the data used, the model to be estimated and the research question at hand.
Simulation Study: Imperfect Multicollinearity
Let us conduct a simulation study to illustrate the issues sketched above.
- We use (6.9) as the data generating process and choose \(\beta_0 = 5\), \(\beta_1 = 2.5\) and \(\beta_2 = 3\) and \(u_i\) is an error term distributed as \(\mathcal{N}(0,5)\). In the first step, we sample the regressor data from a bivariate normal distribution: \[ X_i = (X_{1i}, X_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right].\] It is straightforward to see that the correlation between \(X_1\) and \(X_2\) in the population is rather low:
\[ \rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{2.5}{10} = 0.25. \]
Next, we estimate the model (6.9) and save the estimates for \(\beta_1\) and \(\beta_2\). This is repeated \(10000\) times with a
for
loop so we end up with a large number of estimates that allow us to describe the distributions of \(\hat\beta_1\) and \(\hat\beta_2\).We repeat steps 1 and 2 but increase the covariance between \(X_1\) and \(X_2\) from \(2.5\) to \(8.5\) such that the correlation between the regressors is high: \[ \rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{8.5}{10} = 0.85. \]
In order to assess the effect on the precision of the estimators of increasing the collinearity between \(X_1\) and \(X_2\) we estimate the variances of \(\hat\beta_1\) and \(\hat\beta_2\) and compare.
# load packages
library(MASS)
library(mvtnorm)
# set number of observations
n <- 50
# initialize vectors of coefficients
coefs1 <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
coefs2 <- coefs1
# set seed
set.seed(1)
# loop sampling and estimation
for (i in 1:10000) {
# for cov(X_1,X_2) = 0.25
X <- rmvnorm(n, c(0, 0), sigma = cbind(c(10, 2.5), c(2.5, 10)))
u <- rnorm(n, sd = 5)
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs1[i, ] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
# for cov(X_1,X_2) = 0.85
X <- rmvnorm(n, c(0, 0), sigma = cbind(c(10, 8.5), c(8.5, 10)))
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs2[i, ] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
}
# obtain variance estimates
diag(var(coefs1))
#> hat_beta_1 hat_beta_2
#> 0.05674375 0.05712459
diag(var(coefs2))
#> hat_beta_1 hat_beta_2
#> 0.1904949 0.1909056
We are interested in the variances which are the diagonal elements. We see that due to the high collinearity, the variances of \(\hat\beta_1\) and \(\hat\beta_2\) have more than tripled, meaning it is more difficult to precisely estimate the true coefficients.