9.3 Internal and External Validity when the Regression is used for Forecasting
Recall the regression of test scores on the student-teacher ratio (\(STR\)) performed in Chapter 4:
linear_model <- lm(score ~ STR, data = CASchools)
linear_model
#>
#> Call:
#> lm(formula = score ~ STR, data = CASchools)
#>
#> Coefficients:
#> (Intercept) STR
#> 698.93 -2.28
The estimated regression function was
\[ \widehat{TestScore} = 698.9 - 2.28 \times STR.\]
The book discusses the example of a parent moving to a metropolitan area who plans to choose where to live based on the quality of local schools: a school district’s average test score is an adequate measure for the quality. However, the parent has information on the student-teacher ratio only such that test scores need to be predicted. Although we have established that there is omitted variable bias in this model due to omission of variables like student learning opportunities outside school, the share of English learners and so on, linear_model may in fact be useful for the parent:
The parent need not care if the coefficient on \(STR\) has causal interpretation, she wants \(STR\) to explain as much variation in test scores as possible. Therefore, despite the fact that linear_model cannot be used to estimate the causal effect of a change in \(STR\) on test scores, it can be considered a reliable predictor of test scores in general.
Thus, the threats to internal validity as summarized in Key Concept 9.7 are negligible for the parent. This is, as instanced in the book, different for a superintendent who has been tasked to take measures that increase test scores: she requires a more reliable model that does not suffer from the threats listed in Key Concept 9.7.
Consult Chapter 9.3 of the book (Stock and Watson) for the corresponding discussion.