8.4 Comparing multiple regression models

With more than one variable, we now have many potential models that we could consider. We could include only one of the predictors, all of them, or combinations of sets of the variables. For example, maybe the model that includes Elevation does not “need” both Min.Temp and Max.Temp? Or maybe the model isn’t improved over an SLR with just Elevation as a predictor. Or maybe none of the predictors are “useful”? In this section, we discuss some general model comparison issues and a metric that can be used to pick among a suite of different models (often called a set of candidate models to reflect that they are all potentially interesting and we need to compare them and possibly pick one).

It is certainly possible the researchers may have an a priori reason to only consider a single model. For example, in a designed experiment where combinations of, say, three different predictors are randomly assigned, the initial model with all three predictors may be sufficient to address all the research questions of interest. One advantage in these situations is that the variable combinations can be created to prevent multicollinearity among the predictors and avoid that complication in interpretations. However, this is more the exception than the rule. Usually, there are competing predictors or questions about whether some predictors matter more than others. This type of research always introduces the potential for multicollinearity to complicate the interpretation of each predictor in the presence of others. Because of this, multiple models are often considered, where “unimportant” variables are dropped from the model. The assessment of “importance” using p-values will be discussed in Section 8.6, but for now we will consider other reasons to pick one model over another.

There are some general reasons to choose a particular model:

  1. Diagnostics are better with one model compared to others.

  2. One model predicts/explains the responses better than the others (R2).

  3. a priori reasons to “use” a particular model, for example in a designed experiment or it includes variable(s) whose estimated slopes directly address the research question(s), even if the variables are not “important” in the model.

  4. Model selection “criteria” suggest one model is better than the others121.

It is OK to consider multiple reasons to select a model but it is dangerous to “shop” for a model across many possible models – a practice which is sometimes called data-dredging and leads to a high chance of spurious results from a single model that is usually reported based on this type of exploration. Just like in other discussions of multiple testing issues previously, if you explore many versions of a model, maybe only keeping the best ones, this is very different from picking one model (and tests) a priori and just exploring that result.

As in SLR, we can use the R2 (the coefficient of determination) to measure the percentage of the variation in the response variable that the model explains. In MLR, it is important to remember that R2 is now an overall measure for the model and not specific to a single variable. It is comparable to other models including those fit with only a single predictor (SLR). So to meet criterion (2), we could simply find the model with the largest R2 value, finding the model that explains the most variation in the responses. Unfortunately for this idea, when you add more “stuff” to a regression model (even “unimportant” predictors), the R2 will always go up. This can be seen by considering

\[R^2 = \frac{\text{SS}_{\text{regression}}}{\text{SS}_{\text{total}}}\ \text{ where }\ \text{SS}_{\text{regression}} = \text{SS}_{\text{total}} - \text{SS}_{\text{error}}\ \text{ and }\ \text{SS}_{\text{error}} = \Sigma(y-\hat{y})^2\ .\]

Because adding extra variables to a linear model will only make the fitted values better, not worse, the \(\text{SS}_{\text{error}}\) will always go down if more predictors are added to the model. If \(\text{SS}_{\text{error}}\) goes down and \(\text{SS}_{\text{total}}\) is fixed, then adding extra variables will always increase \(\text{SS}_{\text{regression}}\) and, thus, increase R2. This means that R2 is only useful for selecting models when you are picking between two models of the same size (same number of predictors). So we mainly use it as a summary of model quality once we pick a model, not a method of picking among a set of candidate models. Remember that R2 continues to have the property of being between 0 and 1 (or 0% and 100%) and that value refers to the proportion (percentage) of variation in the response explained by the model, whether we are using it for SLR or MLR.

However, there is an adjustment to the R2 measure that makes it useful for selecting among models. The measure is called the adjusted R2. The \(\boldsymbol{R}^2_{\text{adjusted}}\) measure adds a penalty for adding more variables to the model, providing the potential for this measure to decrease if the extra variables do not really benefit the model. The measure is calculated as

\[R^2_{\text{adjusted}} = 1 - \frac{\text{SS}_{\text{error}}/df_{\text{error}}}{\text{SS}_{\text{total}}/(N-1)} = 1 - \frac{\text{MS}_{\text{error}}}{\text{MS}_{\text{total}}},\]

which incorporates the degrees of freedom for the model via the error degrees of freedom which go down as the model complexity increases. This adjustment means that just adding extra useless variables (variables that do not explain very much extra variation) do not increase this measure. That makes this measure useful for model selection since it can help us to stop adding unimportant variables and find a “good” model among a set of candidates. Like the regular R2, larger values are better. The downside to \(\boldsymbol{R}^2_{\text{adjusted}}\) is that it is no longer a percentage of variation in the response that is explained by the model; it can be less than 0 and so has no interpretable scale. It is just “larger is better”. It provides one method for building a model (different from using p-values to drop unimportant variables as discussed below), by fitting a set of candidate models containing different variables and then picking the model with the largest \(\boldsymbol{R}^2_{\text{adjusted}}\). You will want to interpret this new measure on a percentage scale, but do not do that. It is a just a measure to help you pick a model and that is all it is!

One other caveat in model comparison is worth mentioning: make sure you are comparing models for the same responses. That may sound trivial and usually it is. But when there are missing values in the data set, especially on some explanatory variables and not others, it is important to be careful that the \(y\text{'s}\) do not change between models you are comparing. This relates to our Snow Depth modeling because responses were being removed due to their influential nature. We can’t compare R2 or \(\boldsymbol{R}^2_{\text{adjusted}}\) for \(n=25\) to a model when \(n=23\) – it isn’t a fair comparison on either measure since they based on the total variability which is changing as the responses used change.

In the MLR (or SLR) model summaries, both the R2 and \(\boldsymbol{R}^2_{\text{adjusted}}\) are available. Make sure you are able to pick out the correct one. For the reduced data set (\(n=23\)) Snow Depth models, the pertinent part of the model summary for the model with all three predictors is in the last three lines:

## 
## Call:
## lm(formula = Snow.Depth ~ Elevation + Min.Temp + Max.Temp, data = snotel2[-c(9, 
##     22), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.878  -4.486   0.024   3.996  20.728 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.133e+02  7.458e+01  -2.859   0.0100
## Elevation    2.686e-02  4.997e-03   5.374 3.47e-05
## Min.Temp     9.843e-01  1.359e+00   0.724   0.4776
## Max.Temp     1.243e+00  5.452e-01   2.280   0.0343
## 
## Residual standard error: 8.832 on 19 degrees of freedom
## Multiple R-squared:  0.8535, Adjusted R-squared:  0.8304 
## F-statistic:  36.9 on 3 and 19 DF,  p-value: 4.003e-08

There is a value for \(\large{\textbf{Multiple R-Squared}} \text{ of } 0.8535\), this is the R2 value and suggests that the model with Elevation, Min and Max temperatures explains 85.4% of the variation in Snow Depth. The \(\boldsymbol{R}^2_{\text{adjusted}}\) is 0.8304 and is available further to the right labeled as \(\color{red}{\large{\textbf{Adjusted R-Squared}}}\). We repeated this for a suite of different models for this same \(n=23\) data set and found the following results in Table 2.13. The top \(\boldsymbol{R}^2_{\text{adjusted}}\) model is the model with Elevation and Max.Temp, which beats out the model with all three variables on \(\boldsymbol{R}^2_{\text{adjusted}}\). Note that the top R2 model is the model with three predictors, but the most complicated model will always have that characteristic.

Table 2.13: Model comparisons for Snow Depth data, sorted by model complexity.
Model         \(\boldsymbol{K}\) \(\boldsymbol{R^2}\) \(\boldsymbol{R^2_{\text{adjusted}}}\) \(\boldsymbol{R^2_{\text{adjusted}}}\) Rank
SD \(\sim\) Elevation 1 0.8087 0.7996 3
SD \(\sim\) Min.Temp 1 0.6283 0.6106 5
SD \(\sim\) Max.Temp 1 0.4131 0.3852 7
SD \(\sim\) Elevation + Min.Temp 2 0.8134 0.7948 4
SD \(\sim\) Elevation + Max.Temp 2 0.8495 0.8344 1
SD \(\sim\) Min.Temp + Max.Temp 2 0.6308 0.5939 6
SD \(\sim\) Elevation + Min.Temp + Max.Temp 3 0.8535 0.8304 2

The top adjusted R2 model contained Elevation and Max.Temp and has an R2 of 0.8495, so we can say that the model with Elevation and Maximum Temperature explains 84.95% percent of the variation in Snow Depth and also that this model was selected based on the \(\boldsymbol{R}^2_{\text{adjusted}}\). One of the important features of \(\boldsymbol{R}^2_{\text{adjusted}}\) is available in this example – adding variables often does not always increase its value even though R2 does increase with any addition. In Section 8.13 we consider a competitor for this model selection criterion that may “work” a bit better and be extendable into more complicated modeling situations; that measure is called the AIC.


  1. Also see Section 8.13 for another method of picking among different models.