{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import statsmodels.formula.api as smf\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from auxiliary import get_treatment_probability\n", "from auxiliary import get_plot_probability\n", "from auxiliary import plot_outcomes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regression discontinuity design" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This following material is mostly based on the following review:\n", "\n", "* Lee, D. S., and Lemieux, T. (2010). [Regression discontinuity designs in economics](https://www.aeaweb.org/articles?id=10.1257/jel.48.2.281). *Journal of Economic Literature, 48*(2), 281–355.\n", "\n", "The idea of the authors is to throughout contrast RDD to its alternatives. They initially just mention selected features throughout the introduction but then also devote a whole section to it. This clearly is a core strength of the article. I hope to maintain this focus in my lecture. Also, their main selling point for RDD as the close cousin to standard randomized controlled trial is that the behavioral assumption of imprecise control about the assignment variable translates\n", "into the statistical assumptions of a randomized experiment.\n", "\n", "**Original application**\n", "\n", "In the initial application of RD designs, Thistlethwaite & Campell (1960) analyzed the impact of merit rewards on future academic outcomes. The awards were allocated based on the observed test score. The main idea behind the research design was that individuals with scores just below the cutoff (who did not get the award) were good comparisons to those just above the cutoff (who did receive the award).\n", "\n", "\n", "**Causal graph**\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intuition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Key points of RD design**\n", "\n", "- RD designs can be invalid if individuals can precisely manipulate the assignment variable - discontinuity rules might generate incentives\n", "\n", "- If individuals - even while having some influence - are unable to precisely manipulate the assignment variable, a consequence of this is that the variation in treatment near the threshold is randomized as though from a randomized experiment - contrast to IV assumption\n", "\n", "- RD designs can be analyzed - and tested - like randomized experiments.\n", "\n", "- Graphical representation of an RD design is helpful and informative, but the visual presentation should not be tilted toward either finding an effect or finding no effect.\n", "\n", "- Nonparametric estimation does not represent a \"solution\" to functional form issues raised by RD designs. It is therefore helpful to view it as a complement to - rather than a substitute for - parametric estimation.\n", "\n", "- Goodness-of-fit and other statistical tests can help rule out overly restrictive specifications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Baseline**\n", "\n", "A simple way to estimating the treatment effect $\\tau$ is to run the following linear regression.\n", "\n", "\\begin{align*}\n", "Y = \\alpha + D \\tau + X \\beta + \\epsilon,\n", "\\end{align*}\n", "\n", "where $D \\in [0, 1]$ and we have $D = 1$ if $X \\geq c$ and $D=0$ otherwise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Baseline setup**\n", "\n", "\n", "\n", "* \"all other factors\" determining $Y$ must be evolving \"smoothly\" (continously) with respect to $X$.\n", "\n", "* the estimate will depend on the functional form" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Potential outcome framework**\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Potential outcome framework**\n", "\n", "\n", "Suppose $D = 1$ if $X \\geq c$, and $D=0$ otherwise\n", "\\begin{align*}\n", "\\Rightarrow\\begin{cases}\n", "E(Y \\mid X = c) = E(Y_0 \\mid X = c) & \\text{for}\\quad X < c \\\\\n", "E(Y \\mid X = c) = E(Y_1 \\mid X = c) & \\text{for}\\quad X \\geq c\n", "\\end{cases}\n", "\\end{align*}\n", "\n", "Suppose $E(Y_1\\mid X = c), E(Y_0\\mid X = c)$ are continuous in $x$.\n", "\\begin{align*}\n", "\\Rightarrow\\begin{cases}\n", "\\lim_{\\epsilon \\searrow 0} E(Y_0\\mid X = c - \\epsilon) = E(Y_0\\mid X = c) \\\\\n", "\\lim_{\\epsilon \\searrow 0} E(Y_1\\mid X = c + \\epsilon) = E(Y_1\\mid X = c) \\\\\n", "\\end{cases}\n", "\\end{align*}\n", "\n", "\\begin{align*}\n", "&\\lim_{\\epsilon \\searrow 0} E(Y\\mid X = c + \\epsilon) - \\lim_{\\epsilon \\searrow 0} E(Y\\mid X = c - \\epsilon) \\\\\n", "&\\qquad= \\lim_{\\epsilon \\searrow 0} E(Y_1\\mid X = c + \\epsilon) - \\lim_{\\epsilon \\searrow 0}E(Y_0\\mid X = c - \\epsilon) \\\\\n", "&\\qquad=E(Y_1\\mid X = c) - E(Y_0\\mid X = c) \\\\\n", "&\\qquad=E(Y_1 - Y_0\\mid X = c)\n", "\\end{align*}\n", "\n", "$\\Rightarrow$ average treatment effect at the cutoff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sharp and Fuzzy design" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "grid = np.linspace(0, 1.0, num=1000)\n", "for version in [\"sharp\", \"fuzzy\"]:\n", " probs = get_treatment_probability(version, grid)\n", " get_plot_probability(version, grid, probs)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "for version in [\"sharp\", \"fuzzy\"]:\n", " plot_outcomes(version, grid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Alternatives**\n", "\n", "Consider the standard assumptions for matching:\n", "\n", "- ignorability - trivially satisfied by research design as there is no variation left in $D$ conditional on $X$\n", "- common support - cannot be satisfied and replaced by continuity\n", "\n", "Lee and Lemieux (2010) emphasize the close connection of RDD to randomized experiments.\n", "- How does the graph in the potential outcome framework change?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Continuity, the key assumption of RDD, is a consequence of the research design (e.g. randomization) and not simply imposed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Identification\n", "\n", "Ad-hoc $\\times$ vs. thoughtful answers $\\checkmark$. Both are true, but only thoughtful consideration clarifies the strength of the regression discontinuity design as opposed to, for example, an instrumental variables approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**\n", "\n", "How do I know whether an RD design is appropriate for my context? When are the identification assumptions plausable or implausable?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answers**\n", "\n", "$\\times$ An RD design will be appropriate if it is plausible that all other unobservable factors are \"continuously\" related to the assignment variable.\n", "\n", "$\\checkmark$ When there is a continuously distributed stochastic error component to the assignment variable - which can occur when optimizing agents do not have \\textit{precise} control over the assignment variable - then the variation in the treatment will be as good as randomized in a neighborhood around the discontinuity threshold." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**\n", "\n", "Is there any way I can test those assumptions?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answers**\n", "\n", "$\\times$ No, the continuity assumption is necessary so there are no tests for the validity of the design.\n", "\n", "$\\checkmark$ Yes. As in randomized experiment, the distribution of observed baseline covariates should not change discontinuously around the threshold." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Simplified setup**\n", "\n", "\\begin{align*}\n", "Y & = D \\tau + W \\delta_1 + U \\\\\n", "D & = I [X \\geq c] \\\\\n", "X & = W \\delta_2 + V\n", "\\end{align*}\n", "\n", "- $W$ is the vector of all predetermined and observable characteristics.\n", "\n", "What are the source of heterogeneity in the outcome and assignment variable?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The setup for an RD design is more flexible than other estimation strategies.\n", "- We allow for $W$ to be endogenously determined as long as it is determined prior to $V$. This ensures some random variation around the threshold.\n", "- We take no stance as to whether some elements $\\delta_1$ and $\\delta_2$ are zero (exclusion restrictions)\n", "- We make no assumptions about the correlations between $W$, $U$, and $V$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Local randomization**\n", "\n", "We say individuals have imprecise control over $X$ when conditional on $W = w$ and $U = u$ the density of $V$ (and hence $X$) is continuous." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Applying Baye's rule**\n", "\n", "\\begin{align*}\n", "& \\Pr[W = w, U = u \\mid X = x] \\\\\n", "&\\qquad\\qquad = f(x \\mid W = w, U = u) \\quad\\frac{\\Pr[W = w, U = u]}{f(x)}\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Local randomization:** If individuals have imprecise control over $X$ as defined above, then $\\Pr[W =w, U = u \\mid X = x]$ is continuous in $x$: the treatment is \"as good as\" randomly assigned around the cutoff.\n", "\n", "$\\Rightarrow$ the behavioral assumption of imprecise control of $X$ around the threshold has the prediction that treatment is locally randmized." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Consequences**\n", "\n", "- testing prediction that $\\Pr[W =w, U = u \\mid X = x]$ is continuous in $X$ by at least looking at $\\Pr[W =w\\mid X = x]$\n", "- irrelevance of including baseline covariates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interpretation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Questions**\n", "\n", "To what extent are results from RD designs generalizable?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answers**\n", "\n", "$\\times$ The RD estimate of the treatment effect is only applicable to the subpopulation of individuals at the discontinuity threshold and uninformative about the effect everywhere else.\n", "\n", "$\\checkmark$ The RD estimand can be interpreted as a weighted average treatment effect, where the weights are relative ex ante probability that the value of an individual's assignment variable will be in the neighborhood of the threshold." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Alternative evaluation strategies\n", "\n", "- randomized experiment\n", "- regression discontinuity design\n", "- matching on observables\n", "- instrumental variables\n", "\n", "How do the (assumed) relationships between treatment, observables, and unobservable differ across research designs?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Endogenous dummy variable**\n", "\n", "\\begin{align*}\n", "Y & = D \\tau + W \\delta_1 + U \\\\\n", "D & = I[X \\geq c] \\\\\n", "X & = W \\delta_2 + V\n", "\\end{align*}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "* By construction $X$ is not related to any other observable or unoservable characteristic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "* $W$ and $D$ might be systematically related to $X$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "* The crucial assumptions is that the two lines in the left graph are actually superimposed of each other.\n", "\n", "* The plot in the middle is missing as all variables are used for estimation are not available to test the validity of identifying assumptions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "* The instrument must affect treatment probablity.\n", "* A proper instructment requires the line in the right graph to be flat." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nonlinear expectation\n", "\n", "A nonlinear conditional expectation can easily lead to misleading result if the estimated model is based on an local linear regression. The example below, including the simulation code, is adopted from Cunningham (2021). This example is set up closely aligned with the potential outcome framework. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(columns=[\"Y\", \"Y1\", \"Y0\", \"X\", \"X2\"], dtype=float)\n", "\n", "# We simulate a running variable, truncate it at\n", "# zero and restrict it below 240.\n", "df[\"X\"] = np.random.normal(100, 50, 1000)\n", "df.loc[df[\"X\"] < 0, \"X\"] = 0\n", "df = df[df[\"X\"] < 280]\n", "\n", "df[\"X2\"] = df[\"X\"] ** 2\n", "\n", "df[\"D\"] = 0\n", "df.loc[df[\"X\"] > 140, \"D\"] = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now simulate the potential outcomes and record the observed outcome. Note that there is no effect of treatment." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def get_outcomes(x, d):\n", "\n", " level = 10000 - 100 * x + x ** 2\n", " eps = np.random.normal(0, 1000, 2)\n", " y1, y0 = level + eps\n", " y = d * y1 + (1 - d) * y0\n", "\n", " return y, y1, y0\n", "\n", "\n", "for idx, row in df.iterrows():\n", " df.loc[idx, [\"Y\", \"Y1\", \"Y0\"]] = get_outcomes(row[\"X\"], row[\"D\"])\n", "\n", "df = df.astype(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about the difference in average outcomes by treatment status. Where does the difference come from?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "D\n", "0.0 9836.848389\n", "1.0 21643.244614\n", "Name: Y, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"D\")[\"Y\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready for a proper RDD setup." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: Y R-squared: 0.783\n", "Model: OLS Adj. R-squared: 0.782\n", "Method: Least Squares F-statistic: 1795.\n", "Date: Wed, 07 Jul 2021 Prob (F-statistic): 0.00\n", "Time: 08:59:13 Log-Likelihood: -9407.4\n", "No. Observations: 1000 AIC: 1.882e+04\n", "Df Residuals: 997 BIC: 1.884e+04\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 4377.3226 241.423 18.131 0.000 3903.568 4851.077\n", "D 5889.2320 319.611 18.426 0.000 5262.044 6516.420\n", "X 68.1328 2.699 25.247 0.000 62.837 73.429\n", "==============================================================================\n", "Omnibus: 799.728 Durbin-Watson: 2.046\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 27622.480\n", "Skew: 3.372 Prob(JB): 0.00\n", "Kurtosis: 27.849 Cond. No. 430.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: Y R-squared: 0.973\n", "Model: OLS Adj. R-squared: 0.973\n", "Method: Least Squares F-statistic: 1.215e+04\n", "Date: Wed, 07 Jul 2021 Prob (F-statistic): 0.00\n", "Time: 08:59:13 Log-Likelihood: -8357.0\n", "No. Observations: 1000 AIC: 1.672e+04\n", "Df Residuals: 996 BIC: 1.674e+04\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 1.005e+04 107.870 93.125 0.000 9833.768 1.03e+04\n", "D 2.2077 131.768 0.017 0.987 -256.368 260.783\n", "X -99.9678 2.202 -45.405 0.000 -104.288 -95.647\n", "X2 0.9959 0.012 84.522 0.000 0.973 1.019\n", "==============================================================================\n", "Omnibus: 0.261 Durbin-Watson: 2.002\n", "Prob(Omnibus): 0.878 Jarque-Bera (JB): 0.164\n", "Skew: -0.004 Prob(JB): 0.921\n", "Kurtosis: 3.062 Cond. No. 6.81e+04\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 6.81e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n" ] } ], "source": [ "for ext_ in [\"X\", \"X + X2 \"]:\n", " rslt = smf.ols(formula=f\"Y ~ D + {ext_}\", data=df).fit()\n", " print(rslt.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a nutsheel, the misspecification of the model for the conditional mean functions results in flawed inference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lee (2008)\n", "\n", "The author studies the \"incumbency advantage\", i.e. the overall causal impact of being the current incumbent party in a district on the votes obtained in the district's election.\n", "\n", "* Lee, David S. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vote_lastvote_next
00.10490.5810
10.13930.4611
2-0.07360.5434
30.08680.5846
40.39940.5803
\n", "
" ], "text/plain": [ " vote_last vote_next\n", "0 0.1049 0.5810\n", "1 0.1393 0.4611\n", "2 -0.0736 0.5434\n", "3 0.0868 0.5846\n", "4 0.3994 0.5803" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_base = pd.read_csv(\"../../datasets/processed/msc/house.csv\")\n", "df_base.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put in some effort to ease the flow of our coming analysis." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df_base.rename(columns={\"vote_last\": \"last\", \"vote_next\": \"next\"}, inplace=True)\n", "\n", "df_base[\"incumbent_last\"] = np.where(df_base[\"last\"] > 0.0, \"democratic\", \"republican\")\n", "df_base[\"incumbent_next\"] = np.where(df_base[\"next\"] > 0.5, \"democratic\", \"republican\")\n", "\n", "df_base[\"D\"] = df_base[\"last\"] > 0\n", "\n", "for level in range(2, 5):\n", " label = \"last_{:}\".format(level)\n", " df_base.loc[:, label] = df_base[\"last\"] ** level" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column `vote_last` refers to the Democrat's winning margin and is thus bounded between $-1$ and $1$. So a positive number indicates a Democrat as the incumbent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are the basic characteristics of the dataset?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_base.plot.scatter(x=\"last\", y=\"next\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is going on at the boundary? What is the re-election rate?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Re-election rate: 90.93%\n" ] } ], "source": [ "info = pd.crosstab(df_base[\"incumbent_last\"], df_base[\"incumbent_next\"], normalize=True)\n", "stat = info.to_numpy().diagonal().sum() * 100\n", "print(f\"Re-election rate: {stat:5.2f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regression discontinuity design" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the average vote in the next election look like as we move along last year's election." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_base[\"bin\"] = pd.cut(df_base[\"last\"], 200, labels=False) / 100 - 1\n", "df_base.groupby(\"bin\")[\"next\"].mean().plot(xlabel=\"last\", ylabel=\"next\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now compute the difference at the cutoffs to get an estimate for the treatment effect." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Treatment Effect: 0.096\n" ] } ], "source": [ "h = 0.05\n", "df_subset = df_base[df_base[\"last\"].between(-h, h)]\n", "stat = np.abs(df_subset.groupby(\"incumbent_last\")[\"next\"].mean().diff()[1])\n", "print(f\"Treatment Effect: {stat:5.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the effect depend on the size subset under consideration?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression approach" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we turn to an explicit model of the conditional mean. We first set up explicit models on both sides of the cutoff and then aggreagte the model into single regression estimations." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " Republican\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: next R-squared: 0.271\n", "Model: OLS Adj. R-squared: 0.270\n", "Method: Least Squares F-statistic: 339.2\n", "Date: Wed, 30 Jun 2021 Prob (F-statistic): 3.05e-187\n", "Time: 12:05:34 Log-Likelihood: 1749.4\n", "No. Observations: 2740 AIC: -3491.\n", "Df Residuals: 2736 BIC: -3467.\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 0.4278 0.007 57.880 0.000 0.413 0.442\n", "last -0.0971 0.077 -1.264 0.206 -0.248 0.054\n", "last_2 -1.7177 0.205 -8.359 0.000 -2.121 -1.315\n", "last_3 -1.4636 0.142 -10.338 0.000 -1.741 -1.186\n", "==============================================================================\n", "Omnibus: 203.681 Durbin-Watson: 1.866\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 1087.416\n", "Skew: -0.022 Prob(JB): 7.42e-237\n", "Kurtosis: 6.086 Cond. No. 113.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\n", "\n", " Democratic\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: next R-squared: 0.379\n", "Model: OLS Adj. R-squared: 0.379\n", "Method: Least Squares F-statistic: 776.5\n", "Date: Wed, 30 Jun 2021 Prob (F-statistic): 0.00\n", "Time: 12:05:34 Log-Likelihood: 2055.2\n", "No. Observations: 3818 AIC: -4102.\n", "Df Residuals: 3814 BIC: -4077.\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 0.5393 0.007 71.995 0.000 0.525 0.554\n", "last 0.3553 0.071 4.998 0.000 0.216 0.495\n", "last_2 0.1932 0.174 1.107 0.268 -0.149 0.535\n", "last_3 -0.2111 0.114 -1.856 0.064 -0.434 0.012\n", "==============================================================================\n", "Omnibus: 439.976 Durbin-Watson: 2.136\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 1993.314\n", "Skew: -0.477 Prob(JB): 0.00\n", "Kurtosis: 6.409 Cond. No. 114.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" ] } ], "source": [ "def fit_regression(incumbent, df, level=4):\n", "\n", " df_incumbent = df[df[\"incumbent_last\"] == incumbent].copy()\n", "\n", " formula = \"next ~ last\"\n", " for level in range(2, level + 1):\n", " label = \"last_{:}\".format(level)\n", " formula += f\" + {label}\"\n", "\n", " rslt = smf.ols(formula=formula, data=df_incumbent).fit()\n", " return rslt\n", "\n", "\n", "rslt = dict()\n", "for incumbent in [\"republican\", \"democratic\"]:\n", " rslt = fit_regression(incumbent, df_base, level=3)\n", " title = \"\\n\\n {:}\\n\".format(incumbent.capitalize())\n", " print(title, rslt.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the predictions look like?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "dfs = list()\n", "\n", "for incumbent in [\"republican\", \"democratic\"]:\n", " rslt = fit_regression(incumbent, df_base, level=4)\n", "\n", " # For our predictions, we need to set up a grid for the evaluation.\n", " if incumbent == \"republican\":\n", " grid = np.linspace(-0.5, 0.0, 51)\n", " else:\n", " grid = np.linspace(+0.0, 0.5, 51)\n", "\n", " df_grid = pd.DataFrame(grid, columns=[\"last\"])\n", "\n", " for level in range(2, 5):\n", " label = \"last_{:}\".format(level)\n", " df_grid.loc[:, label] = df_grid[\"last\"] ** level\n", "\n", " tmp = pd.DataFrame(rslt.predict(df_grid), columns=[\"prediction\"])\n", " tmp.index = df_grid[\"last\"]\n", " dfs.append(tmp)\n", "\n", "rslts = pd.concat(dfs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the estimated conditional mean fuctions." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "rslts.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regression " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several alternatives to estimate the conditional mean functions.\n", "\n", "* pooled regressions\n", "\n", "* local linear regressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pooled regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We estimate the conditinal mean using the whole function.\n", "\n", "\\begin{align*}\n", "Y = \\alpha + \\tau D + \\beta X + \\epsilon\n", "\\end{align*}\n", "\n", "This allows for a difference in levels but not slope." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: next R-squared: 0.670
Model: OLS Adj. R-squared: 0.670
Method: Least Squares F-statistic: 6658.
Date: Wed, 30 Jun 2021 Prob (F-statistic): 0.00
Time: 12:05:34 Log-Likelihood: 3661.9
No. Observations: 6558 AIC: -7318.
Df Residuals: 6555 BIC: -7298.
Df Model: 2
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.4427 0.003 139.745 0.000 0.437 0.449
D[T.True] 0.1137 0.006 20.572 0.000 0.103 0.125
last 0.3305 0.006 55.186 0.000 0.319 0.342
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 595.910 Durbin-Watson: 2.143
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3444.243
Skew: -0.225 Prob(JB): 0.00
Kurtosis: 6.522 Cond. No. 5.69


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: next R-squared: 0.670\n", "Model: OLS Adj. R-squared: 0.670\n", "Method: Least Squares F-statistic: 6658.\n", "Date: Wed, 30 Jun 2021 Prob (F-statistic): 0.00\n", "Time: 12:05:34 Log-Likelihood: 3661.9\n", "No. Observations: 6558 AIC: -7318.\n", "Df Residuals: 6555 BIC: -7298.\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 0.4427 0.003 139.745 0.000 0.437 0.449\n", "D[T.True] 0.1137 0.006 20.572 0.000 0.103 0.125\n", "last 0.3305 0.006 55.186 0.000 0.319 0.342\n", "==============================================================================\n", "Omnibus: 595.910 Durbin-Watson: 2.143\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 3444.243\n", "Skew: -0.225 Prob(JB): 0.00\n", "Kurtosis: 6.522 Cond. No. 5.69\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smf.ols(formula=\"next ~ last + D\", data=df_base).fit().summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Local linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now turn to local regressions by restricting the estimation to observations close to the cutoff.\n", "\n", "\\begin{align*}\n", "Y = \\alpha + \\tau D + \\beta X + \\gamma X D + \\epsilon,\n", "\\end{align*}\n", "\n", "where $-h \\geq X \\geq h$. This allows for a difference in levels and slope." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Bandwidth: 0.3 Effect 8.318% pvalue 0.000\n", " Bandwidth: 0.2 Effect 7.818% pvalue 0.000\n", " Bandwidth: 0.1 Effect 6.058% pvalue 0.000\n", " Bandwidth: 0.05 Effect 4.870% pvalue 0.010\n", " Bandwidth: 0.01 Effect 9.585% pvalue 0.001\n" ] } ], "source": [ "for h in [0.3, 0.2, 0.1, 0.05, 0.01]:\n", " # We restrict the sample to observations close\n", " # to the cutoff.\n", " df = df_base[df_base[\"last\"].between(-h, h)]\n", "\n", " formula = \"next ~ D + last + D * last\"\n", " rslt = smf.ols(formula=formula, data=df).fit()\n", " info = [h, rslt.params[1] * 100, rslt.pvalues[1]]\n", " print(\" Bandwidth: {:>4} Effect {:5.3f}% pvalue {:5.3f}\".format(*info))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There exists some work that can guide the choice of the bandwidth. Now, let's summarize the key issues and some review best practices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checklist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recommendations:**\n", "- To assess the possibility of manipulations of the assignment variable, show its distribution.\n", "- Present the main RD graph using binned local averages.\n", "- Graph a benchmark polynomial specification\n", "- Explore the sensitivity of the results to a range of bandwidth, and a range of orders to the polynomial.\n", "- Conduct a parallel RD analysis on the baseline covariates.\n", "- Explore the sensitivity of the results to the inclusion of baseline covariates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "* **Cattaneo, M.D., Idrobo, N., & Titiunik, R. (2019)**. [A practical Introduction to Regression Discontinuity Designs: Foundations](https://www.amazon.com/Introduction-Regression-Discontinuity-Quantitative-Computational/dp/1108710204?asin=1108710204&revisionId=&format=4&depth=1), Cambridge University Press.\n", "\n", "\n", "* **Cunningham, S. (2021)**. [Causal Inference: The Mixtape](https://www.scunning.com/mixtape.html#:~:text=Causal%20Inference%3A%20The%20Mixtape.%20An%20accessible%2C%20contemporary%20introduction,allow%20social%20scientists%20to%20determine%20what%20causes%20what.). *Yale University Press*\n", "\n", "\n", "* **Hahn, J., Todd, P. E., and van der Klaauw, W. (2001)**. [Identification and estimation of treatment effects with a regression-discontinuity design](https://www.jstor.org/stable/2692190). *Econometrica*, 69(1), 201–209.\n", "\n", "\n", "* **Imbens, G., & Lemieux, G. (2007)**. [Regression discontinuity designs: A guide to practice](https://scholar.harvard.edu/files/imbens/files/regression_discontinuity_designs_a_guide_to_practice.pdf). *Journal of Econometrics*, 142 (2) :615-635.\n", "\n", "\n", "* **Lee, D. S. (2008)**. [Randomized experiments from nonrandom selection in US House elections](https://www.sciencedirect.com/science/article/abs/pii/S0304407607001121). *Journal of Econometrics*, 142(2), 675–697.\n", "\n", "\n", "* **Lee, D. S., and Lemieux, T. (2010)**. [Regression discontinuity designs in economics](https://www.aeaweb.org/articles?id=10.1257/jel.48.2.281). *Journal of Economic Literature*, 48(2), 281–355.\n", "\n", "\n", "* **Thistlethwaite, D. L., and Campbell, D. T. (1960)**. [Regression-discontinuity analysis: An alternative to the ex-post facto experiment](https://psycnet.apa.org/record/1962-00061-001). *Journal of Educational Psychology*, 51(6), 309–317.\n", "\n", "\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }