6.6 Exercises
1. The Boston Housing Data Set
For the course of this section, you will work with Boston, the Boston Housing data set which contains 506 observations on housing values in suburbs of Boston. Boston data set comes with the package MASS. Both the package MASS and AER are required for the interactive R exercises below, and they are already installed.
Instructions:
Load both the package and the data set.
Get yourself an overview over the data using function(s) known from the previous chapters.
Estimate a simple linear regression model that explains the median house value of districts (medv) by the percent of households with low socioeconomic status, lstat, and a constant. Save the model to bh_mod.
Print a coefficient summary to the console that reports robust standard errors.
Hint:
You only need basic R functions here: library(), data(), lm() and coeftest().
2. A Multiple Regression Model of Housing Prices I
Now, let us expand the approach from the previous exercise by adding additional regressors to the model and estimating it again.
As has been discussed in Chapter 6.3, adding regressors to the model improves the fit so the \(SER\) decreases and the \(R^2\) increases.
The packages AER and MASS have been loaded. The model object bh_mod is available in the environment.
Instructions:
Regress the median housing value in a district, medv, on the average age of the buildings, age, the per-capita crime rate, crim, the percentage of individuals with low socioeconomic status, lstat, and a constant. Put differently, estimate the model \[medv_i = \beta_0 + \beta_1 lstat_i + \beta_2 age_i + \beta_3 crim_i + u_i.\]
Print a coefficient summary to the console that reports robust standard errors for the augmented model.
The \(R^2\) of the simple regression model is stored in R2_res. Save the multiple regression model’s \(R^2\) to R2_unres and check whether the augmented model yields a higher \(R^2\). Use < or > for the comparison.
3. A Multiple Regression Model of Housing Prices II
The equation below describes estimated model from Exercise 2 (heteroskedasticity-robust standard errors in parentheses).
\[ \widehat{medv}_i = \underset{(0.74)}{32.828} \underset{(0.08)}{-0.994} \times lstat_i \underset{(0.03)}{-0.083} \times crim_i + \underset{(0.02)}{0.038} \times age_i\]
This model is saved in bh_mult_mod which is available in the working environment.
Instructions:
As has been stressed in Chapter 6.3, it is not meaningful to use \(R^2\) when comparing regression models with a different number of regressors. Instead, the \(\bar{R}^2\) should be used. \(\bar{R}^2\) adjusts for the circumstance that the \(SSR\) reduces when a regressor is added to the model.
Use the model object to compute the correction factor \(CF = \frac{n-1}{n-k-1}\) where \(n\) is the number of observations and \(k\) is the number of regressors, excluding the intercept. Save it to CF.
Use summary() to obtain \(R^2\) and \(\bar{R}^2\) for bh_mult_mod. It is sufficient if you print both values to the console.
Check that \[\bar{R}^2 = 1 - (1-R^2) \cdot CF.\] Use the == operator.
4. A Fully-Fledged Model for Housing Values?
Have a look at the description of the variables contained in the Boston data set. Which variable would you expect to have the lowest \(p\)-value in a multiple regression model which uses all variables as regressors to explain medv?
Instructions:
Regress medv on all remaining variables that you find in the Boston data set.
Obtain a heteroskedasticity-robust summary of the coefficients.
The \(\bar{R}^2\) for the model in exercise 3 is \(0.5533\). What can you say about the \(\bar{R}^2\) of the large regression model? Does this model improve on the previous one (no code submission needed)?
The packages AER and MASS as well as the data set Boston are loaded to the working environment.
Hints:
For brevity, use the regression formula medv ~. in your call of lm(). This is a shortcut that specifies a regression of medv on all variables in the data set supplied to the argument data.
Use summary on both models for a comparison of both \(\bar{R}^2\)s.
5. Model Selection
Maybe we can improve the model by dropping a variable?
In this exercise, you have to estimate several models, each time dropping one of the explanatory variables used in the large regression model of Exercise 4 and compare the \(\bar{R}^2\).
The full regression model from the previous exercise, full_mod, is available in your environment.
Instructions:
You are completely free in solving this exercise. We recommend the following approach:
Start by estimating a model mod_new, say, where, e.g., lstat is excluded from the explanatory variables. Next, access the \(\bar{R}^2\) of this model.
Compare the \(\bar{R}^2\) of this model to the \(\bar{R}^2\) of the full model (this was about \(0.7338\)).
Repeat Steps 1 and 2 for all explanatory variables used in the full regression model. Save the model with the highest improvement in \(\bar{R}^2\) to better_mod.