{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# RESEARCH IN PYTHON: INSTRUMENTAL VARIABLES ESTIMATION\n", "# by J. NATHAN MATIAS March 18, 2015\n", "\n", "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n", "# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n", "# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n", "# THE SOFTWARE." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instrumental Variables Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section is taken from [Chapter 10](http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter10/default.htm) of [Methods Matter](http://www.ats.ucla.edu/stat/examples/methods_matter/) by Richard Murnane and John Willett. The descriptions are taken from Wikipedia, for copyright reasons.\n", "\n", "In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment.\n", "\n", "In linear models, there are two main requirements for using an IV:\n", "\n", "* The instrument *must* be correlated with the *endogenous explanatory variables*, conditional on the other covariates.\n", "* The instrument *cannot* be correlated with the *error term* in the explanatory equation (conditional on the other covariates), that is, the instrument cannot suffer from the same problem as the original predicting variable.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example: Predicting Civic Engagement from College Attainment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we use college attainment (COLLEGE) to predict the probability of civic engagement (REGISTER)? College attainment is not randomized, and the arrow of causality may move in the opposite direction, so all we can do with standard regression is to establish a correlation.\n", "\n", "In this example, we use an _Instrumental Variable_ of distance between the student's school and a community college (DISTANCE), to estimate a causal relationship. This is possible only if this variable is related to college attainment and NOT related to the residuals of regressing COLLEGE on REGISTER. \n", "\n", "The python code listed here is roughly parallel to [the code listed in the textbook example](http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter10/default.htm) for Methods Matter Chapter 10. If you're curious about how to do a similar example in R, check out \"[A Simple Instrumental Variables Problem](http://www.r-bloggers.com/a-simple-instrumental-variables-problem/)\" by Adam Hyland in R-Bloggers or Ani Katchova's \"[Instrumental Variables in R](https://www.youtube.com/watch?v=OwM3BgWEgUg) video on YouTube." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# THINGS TO IMPORT\n", "# This is a baseline set of libraries I import by default if I'm rushed for time.\n", "\n", "import codecs # load UTF-8 Content\n", "import json # load JSON files\n", "import pandas as pd # Pandas handles dataframes\n", "import numpy as np # Numpy handles lots of basic maths operations\n", "import matplotlib.pyplot as plt # Matplotlib for plotting\n", "import seaborn as sns # Seaborn for beautiful plots\n", "from dateutil import * # I prefer dateutil for parsing dates\n", "import math # transformations\n", "import statsmodels.formula.api as smf # for doing statistical regression\n", "import statsmodels.api as sm # access to the wider statsmodels library, including R datasets\n", "from collections import Counter # Counter is useful for grouping and counting\n", "import scipy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Acquire Dee Dataset from Methods Matter" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import urllib2\n", "import os.path\n", "if(os.path.isfile(\"dee.dta\")!=True):\n", " response = urllib2.urlopen(\"http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter10/dee.dta\")\n", " if(response.getcode()==200):\n", " f = open(\"dee.dta\",\"w\")\n", " f.write(response.read())\n", " f.close()\n", "dee_df = pd.read_stata(\"dee.dta\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Summary Statistics" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | register | \n", "college | \n", "distance | \n", "
---|---|---|---|
count | \n", "9227.000000 | \n", "9227.000000 | \n", "9227.000000 | \n", "
mean | \n", "0.670857 | \n", "0.547090 | \n", "9.735992 | \n", "
std | \n", "0.469927 | \n", "0.497805 | \n", "8.702286 | \n", "
min | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
25% | \n", "0.000000 | \n", "0.000000 | \n", "3.000000 | \n", "
50% | \n", "1.000000 | \n", "1.000000 | \n", "7.000000 | \n", "
75% | \n", "1.000000 | \n", "1.000000 | \n", "15.000001 | \n", "
max | \n", "1.000000 | \n", "1.000000 | \n", "35.000000 | \n", "