12.1 The IV Estimator with a Single Regressor and a Single Instrument
Consider the simple regression model
\[\begin{align} Y_i = \beta_0 + \beta_1 X_i + u_i \ \ , \ \ i=1,\dots,n \tag{12.1} \end{align}\]
where the error term \(u_i\) is correlated with the regressor \(X_i\) (\(X\) is endogenous) such that OLS is inconsistent for the true \(\beta_1\). In the most simple case, IV regression uses a single instrumental variable \(Z\) to obtain a consistent estimator for \(\beta_1\).
\(Z\) must satisfy two conditions to be a valid instrument:
1. Instrument relevance condition:
2. Instrument exogeneity condition:
The Two-Stage Least Squares Estimator
As can be guessed from its name, TSLS proceeds in two stages. In the first stage, the variation in the endogenous regressor \(X\) is decomposed into a problem-free component that is explained by the instrument \(Z\) and a problematic component that is correlated with the error \(u_i\). The second stage uses the problem-free component of the variation in \(X\) to estimate \(\beta_1\).
The first stage regression model is \[X_i = \pi_0 + \pi_1 Z_i + \nu_i,\] where \(\pi_0 + \pi_1 Z_i\) is the component of \(X_i\) that is explained by \(Z_i\) while \(\nu_i\) is the component that cannot be explained by \(Z_i\) and exhibits correlation with \(u_i\).
Using the OLS estimates \(\widehat{\pi}_0\) and \(\widehat{\pi}_1\) we obtain predicted values \(\widehat{X}_i, \ \ i=1,\dots,n\). If \(Z\) is a valid instrument, the \(\widehat{X}_i\) are problem-free in the sense that \(\widehat{X}\) is exogenous in a regression of \(Y\) on \(\widehat{X}\) which is done in the second stage regression. The second stage produces \(\widehat{\beta}_0^{TSLS}\) and \(\widehat{\beta}_1^{TSLS}\), the TSLS estimates of \(\beta_0\) and \(\beta_1\).
For the case of a single instrument one can show that the TSLS estimator of \(\beta_1\) is
\[\begin{align} \widehat{\beta}_1^{TSLS} = \frac{s_{ZY}}{s_{ZX}} = \frac{\frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})(Z_i - \overline{Z})}{\frac{1}{n-1}\sum_{i=1}^n(X_i - \overline{X})(Z_i - \overline{Z})}, \tag{12.2} \end{align}\]
which is nothing but the ratio of the sample covariance between \(Z\) and \(Y\) to the sample covariance between \(Z\) and \(X\).
As shown in Appendix 12.3 of the book, (12.2) is a consistent estimator for \(\beta_1\) in (12.1) under the assumption that \(Z\) is a valid instrument. Just as for every other OLS estimator we have considered so far, the CLT implies that the distribution of \(\widehat{\beta}_1^{TSLS}\) can be approximated by a normal distribution if the sample size is large. This allows us to use \(t\)-statistics and confidence intervals which are also computed by certain R functions. A more detailed argument on the large-sample distribution of the TSLS estimator is sketched in Appendix 12.3 of the book.
Application to the Demand For Cigarettes
The relation between the demand for and the price of commodities is a simple yet widespread problem in economics. Health economics is concerned with the study of how health-affecting behavior of individuals is influenced by the health-care system and regulation policy. Probably the most prominent example in public policy debates is smoking as it is related to many illnesses and negative externalities.
It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily. The question is by how much taxes must be increased to reach a certain reduction in cigarette consumption. Economists use elasticities to answer this kind of question. Since the price elasticity for the demand of cigarettes is unknown, it must be estimated. As discussed in the box Who Invented Instrumental Variables Regression presented in Chapter 12.1 of the book, an OLS regression of log quantity on log price cannot be used to estimate the effect of interest since there is simultaneous causality between demand and supply. Instead, IV regression can be used.
We use the data set CigarettesSW which comes with the package AER. It is a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995. Following the book, we consider data for the cross section of states in 1995 only.
We start by loading the package, attaching the data set and getting an overview.
# load the data set and get an overview
library(AER)
data("CigarettesSW")
summary(CigarettesSW)
#> state year cpi population packs
#> AL : 2 1985:48 Min. :1.076 Min. : 478447 Min. : 49.27
#> AR : 2 1995:48 1st Qu.:1.076 1st Qu.: 1622606 1st Qu.: 92.45
#> AZ : 2 Median :1.300 Median : 3697472 Median :110.16
#> CA : 2 Mean :1.300 Mean : 5168866 Mean :109.18
#> CO : 2 3rd Qu.:1.524 3rd Qu.: 5901500 3rd Qu.:123.52
#> CT : 2 Max. :1.524 Max. :31493524 Max. :197.99
#> (Other):84
#> income tax price taxs
#> Min. : 6887097 Min. :18.00 Min. : 84.97 Min. : 21.27
#> 1st Qu.: 25520384 1st Qu.:31.00 1st Qu.:102.71 1st Qu.: 34.77
#> Median : 61661644 Median :37.00 Median :137.72 Median : 41.05
#> Mean : 99878736 Mean :42.68 Mean :143.45 Mean : 48.33
#> 3rd Qu.:127313964 3rd Qu.:50.88 3rd Qu.:176.15 3rd Qu.: 59.48
#> Max. :771470144 Max. :99.00 Max. :240.85 Max. :112.63
#>
Use ?CigarettesSW
for a detailed description of the variables.
We are interested in estimating \(\beta_1\) in
\[\begin{align} \log(Q_i^{cigarettes}) = \beta_0 + \beta_1 \log(P_i^{cigarettes}) + u_i, \tag{12.3} \end{align}\]
where \(Q_i^{cigarettes}\) is the number of cigarette packs per capita sold and \(P_i^{cigarettes}\) is the after-tax average real price per pack of cigarettes in state \(i\).
The instrumental variable we are going to use for instrumenting the endogenous regressor \(\log(P_i^{cigarettes})\) is \(SalesTax\), the portion of taxes on cigarettes arising from the general sales tax. \(SalesTax\) is measured in dollars per pack. The idea is that \(SalesTax\) is a relevant instrument as it is included in the after-tax average price per pack. Also, it is plausible that \(SalesTax\) is exogenous since the sales tax does not influence quantity sold directly but indirectly through the price.
We perform some transformations in order to obtain deflated cross section data for the year 1995.
We also compute the sample correlation between the sales tax and price per pack. The sample correlation is a consistent estimator of the population correlation. The estimate of approximately \(0.614\) indicates that \(SalesTax\) and \(P_i^{cigarettes}\) exhibit positive correlation which meets our expectations: higher sales taxes lead to higher prices. However, a correlation analysis like this is not sufficient for checking whether the instrument is relevant. We will later come back to the issue of checking whether an instrument is relevant and exogenous.
# compute real per capita prices
CigarettesSW$rprice <- with(CigarettesSW, price / cpi)
# compute the sales tax
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi)
# check the correlation between sales tax and price
cor(CigarettesSW$salestax, CigarettesSW$price)
#> [1] 0.6141228
# generate a subset for the year 1995
c1995 <- subset(CigarettesSW, year == "1995")
The first stage regression is \[\log(P_i^{cigarettes}) = \pi_0 + \pi_1 SalesTax_i + \nu_i.\] We estimate this model in R using lm(). In the second stage we run a regression of \(\log(Q_i^{cigarettes})\) on \(\widehat{\log(P_i^{cigarettes})}\) to obtain \(\widehat{\beta}_0^{TSLS}\) and \(\widehat{\beta}_1^{TSLS}\).
# perform the first stage regression
cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)
coeftest(cig_s1, vcov = vcovHC, type = "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.6165463 0.0289177 159.6444 < 2.2e-16 ***
#> salestax 0.0307289 0.0048354 6.3549 8.489e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first stage regression is \[\widehat{\log(P_i^{cigarettes})} = \underset{(0.03)}{4.62} + \underset{(0.005)}{0.031} SalesTax_i,\] which predicts the relation between sales tax price per cigarettes to be positive. How much of the observed variation in \(\log(P^{cigarettes})\) is explained by the instrument \(SalesTax\)? This can be answered by looking at the regression’s \(R^2\) which states that about \(47\%\) of the variation in after tax prices is explained by the variation of the sales tax across states.
We next store \(\widehat{\log(P_i^{cigarettes})}\), the fitted values obtained by the first stage regression cig_s1, in the variable lcigp_pred.
Next, we run the second stage regression which gives us the TSLS estimates we seek.
# run the stage 2 regression
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred)
coeftest(cig_s2, vcov = vcovHC)
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.71988 1.70304 5.7074 7.932e-07 ***
#> lcigp_pred -1.08359 0.35563 -3.0469 0.003822 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus estimating the model (12.3) using TSLS yields
\[\begin{align} \widehat{\log(Q_i^{cigarettes})} = \underset{(1.70)}{9.72} - \underset{(0.36)}{1.08} \log(P_i^{cigarettes}), \tag{12.4} \end{align}\]
where we write \(\log(P_i^{cigarettes})\) instead of \(\widehat{\log(P_i^{cigarettes})}\) for consistency with the book.
The function ivreg() from the package AER carries out TSLS procedure automatically. It is used similarly as lm(). Instruments can be added to the usual specification of the regression formula using a vertical bar separating the model equation from the instruments. Thus, for the regression at hand the correct formula is log(packs) ~ log(rprice) | salestax.
# perform TSLS using 'ivreg()'
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995)
coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.71988 1.52832 6.3598 8.346e-08 ***
#> log(rprice) -1.08359 0.31892 -3.3977 0.001411 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find that the coefficient estimates coincide for both approaches.
Two notes on the computation of TSLS standard errors
We have demonstrated that running the individual regressions for each stage of TSLS using lm() leads to the same coefficient estimates as when using ivreg(). However, the standard errors reported for the second-stage regression (e.g.,coeftest() or summary()) are invalid because they do not account for the use of predictions from the first-stage regression as regressors in the second-stage regression. Fortunately, ivreg() performs the necessary adjustment automatically. This is another advantage over manual step-by-step estimation which we have done above for demonstrating the mechanics of the procedure.
Just like in multiple regression it is important to compute heteroskedasticity-robust standard errors as we have done above using vcovHC().
The TSLS estimate for \(\beta_1\) in (12.4) suggests that an increase in cigarette prices by one percent reduces cigarette consumption by roughly \(1.08\) percentage points, which is fairly elastic. However, we should keep in mind that this estimate might not be trustworthy even though we used IV estimation: there still might be a bias due to omitted variables. Thus a multiple IV regression approach is needed.