{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# RESEARCH IN PYTHON: USING IVE TO RECOVER THE TREATMENT EFFECT\n",
    "# by J. NATHAN MATIAS March 18, 2015\n",
    "\n",
    "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
    "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
    "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n",
    "# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
    "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n",
    "# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n",
    "# THE SOFTWARE."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Instrumental-Variables Estimation to Recover the Treatment Effect in Quasi-Experiments "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section is taken from [Chapter 11](http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter11/default.htm) of [Methods Matter](http://www.ats.ucla.edu/stat/examples/methods_matter/) by Richard Murnane and John Willett. \n",
    "\n",
    "In Chapter 10, Murnane and Willett introduce instrumental variables estimation(IVE) as a method for carving out causal claims from observational data ([chapter summary](http://acawiki.org/Introducing_Instrumental-Variables_Estimation)) ([example code](http://nbviewer.ipython.org/github/natematias/research_in_python/blob/master/instrumental_variables_estimation/Instrumental-Variables%20Estimation.ipynb)). \n",
    "\n",
    "In Chapter 11, the authors explain how IVE can be used to \"recover\" the treatment effect in cases where random assignment is applied to an offer to participate, where not everyone takes the offer, and where other people participate through some other means. They use the example of research on the effectiveness of a financial aid offer on the likelihood of a student to finish 8th grade, using a subset of data from Bogotá from a study on \"[Vouchers for Private Schooling in Columbia](http://www.nber.org/papers/w8343)\" (2002) by Joshua Angrist, Eric Bettinger, Erik Bloom, Elizabeth King, and Michael Kremer ([full data here](http://economics.mit.edu/faculty/angrist/data1/data/angetal02), [subset data here](http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter11/default.htm)).\n",
    "\n",
    "The dataset includes the following variables:\n",
    "* *finish8th*: did the student finish 8th grade or not (outcome variable)\n",
    "* *won_lottry*: won the lottery to receive offer of financial aid\n",
    "* *use_fin_aid*: did the student use financial aid of any kind (not exclusive to the lottery) or not\n",
    "* *base_age*: student age\n",
    "* *male*: is the student male or not"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# THINGS TO IMPORT\n",
    "# This is a baseline set of libraries I import by default if I'm rushed for time.\n",
    "\n",
    "import codecs                     # load UTF-8 Content\n",
    "import json                       # load JSON files\n",
    "import pandas as pd               # Pandas handles dataframes\n",
    "import numpy as np                # Numpy handles lots of basic maths operations\n",
    "import matplotlib.pyplot as plt   # Matplotlib for plotting\n",
    "import seaborn as sns             # Seaborn for beautiful plots\n",
    "from dateutil import *            # I prefer dateutil for parsing dates\n",
    "import math                       # transformations\n",
    "import statsmodels.formula.api as smf  # for doing statistical regression\n",
    "import statsmodels.api as sm      # access to the wider statsmodels library, including R datasets\n",
    "from collections import Counter   # Counter is useful for grouping and counting\n",
    "import scipy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Acquire Dataset from Methods Matter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import urllib2\n",
    "import os.path\n",
    "if(os.path.isfile(\"colombia_voucher.dta\")!=True):\n",
    "    response = urllib2.urlopen(\"http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter11/colombia_voucher.dta\")\n",
    "    if(response.getcode()==200):\n",
    "        f = open(\"colombia_voucher.dta\",\"w\")\n",
    "        f.write(response.read())\n",
    "        f.close()\n",
    "voucher_df = pd.read_stata(\"colombia_voucher.dta\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==============================================================================\n",
      "                              OVERALL SUMMARY\n",
      "==============================================================================\n",
      "                id   won_lottry         male     base_age    finish8th  \\\n",
      "count  1171.000000  1171.000000  1171.000000  1171.000000  1171.000000   \n",
      "mean   1357.010248     0.505551     0.504697    12.004270     0.681469   \n",
      "std     890.711584     0.500183     0.500192     1.347038     0.466106   \n",
      "min       3.000000     0.000000     0.000000     7.000000     0.000000   \n",
      "25%     616.000000     0.000000     0.000000    11.000000     0.000000   \n",
      "50%    1280.000000     1.000000     1.000000    12.000000     1.000000   \n",
      "75%    1982.500000     1.000000     1.000000    13.000000     1.000000   \n",
      "max    4030.000000     1.000000     1.000000    17.000000     1.000000   \n",
      "\n",
      "       use_fin_aid  \n",
      "count  1171.000000  \n",
      "mean      0.581554  \n",
      "std       0.493515  \n",
      "min       0.000000  \n",
      "25%       0.000000  \n",
      "50%       1.000000  \n",
      "75%       1.000000  \n",
      "max       1.000000  \n",
      "==============================================================================\n",
      "                         LOTTERY = 0\n",
      "==============================================================================\n",
      "                id  won_lottry        male    base_age   finish8th  \\\n",
      "count   579.000000         579  579.000000  579.000000  579.000000   \n",
      "mean   1460.998273           0    0.504318   12.036269    0.625216   \n",
      "std     960.839468           0    0.500414    1.351814    0.484486   \n",
      "min       4.000000           0    0.000000    7.000000    0.000000   \n",
      "25%     650.500000           0    0.000000   11.000000    0.000000   \n",
      "50%    1392.000000           0    1.000000   12.000000    1.000000   \n",
      "75%    2122.500000           0    1.000000   13.000000    1.000000   \n",
      "max    4030.000000           0    1.000000   16.000000    1.000000   \n",
      "\n",
      "       use_fin_aid  \n",
      "count   579.000000  \n",
      "mean      0.240069  \n",
      "std       0.427495  \n",
      "min       0.000000  \n",
      "25%       0.000000  \n",
      "50%       0.000000  \n",
      "75%       0.000000  \n",
      "max       1.000000  \n",
      "==============================================================================\n",
      "                         LOTTERY = 1\n",
      "==============================================================================\n",
      "                id  won_lottry        male    base_age   finish8th  \\\n",
      "count   592.000000         592  592.000000  592.000000  592.000000   \n",
      "mean   1255.305743           1    0.505068   11.972973    0.736486   \n",
      "std     804.217066           0    0.500397    1.342755    0.440911   \n",
      "min       3.000000           1    0.000000    9.000000    0.000000   \n",
      "25%     578.750000           1    0.000000   11.000000    0.000000   \n",
      "50%    1210.000000           1    1.000000   12.000000    1.000000   \n",
      "75%    1707.250000           1    1.000000   13.000000    1.000000   \n",
      "max    4006.000000           1    1.000000   17.000000    1.000000   \n",
      "\n",
      "       use_fin_aid  \n",
      "count   592.000000  \n",
      "mean      0.915541  \n",
      "std       0.278311  \n",
      "min       0.000000  \n",
      "25%       1.000000  \n",
      "50%       1.000000  \n",
      "75%       1.000000  \n",
      "max       1.000000  \n"
     ]
    }
   ],
   "source": [
    "print \"==============================================================================\"\n",
    "print \"                              OVERALL SUMMARY\"\n",
    "print \"==============================================================================\"\n",
    "\n",
    "print voucher_df.describe()\n",
    "\n",
    "for i in range(2):\n",
    "    print \"==============================================================================\"\n",
    "    print \"                         LOTTERY = %(i)d\" % {\"i\":i}\n",
    "    print \"==============================================================================\"\n",
    "    print voucher_df[voucher_df['won_lottry']==i].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Two-stage Least Squares Logistic Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " If you're interested to learn more on the rationale and process for doing this kind of analysis, Murnane and Willett introduce instrumental variables estimation(IVE) as a method for carving out causal claims from observational data ([chapter summary](http://acawiki.org/Introducing_Instrumental-Variables_Estimation)) ([example code](http://nbviewer.ipython.org/github/natematias/research_in_python/blob/master/instrumental_variables_estimation/Instrumental-Variables%20Estimation.ipynb)). \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==============================================================================\n",
      "                                  FIRST STAGE\n",
      "==============================================================================\n",
      "                 Generalized Linear Model Regression Results                  \n",
      "==============================================================================\n",
      "Dep. Variable:            use_fin_aid   No. Observations:                 1171\n",
      "Model:                            GLM   Df Residuals:                     1167\n",
      "Model Family:                Binomial   Df Model:                            3\n",
      "Link Function:                  logit   Scale:                             1.0\n",
      "Method:                          IRLS   Log-Likelihood:                -488.00\n",
      "Date:                Thu, 19 Mar 2015   Deviance:                       975.99\n",
      "Time:                        23:08:46   Pearson chi2:                 1.16e+03\n",
      "No. Iterations:                     7                                         \n",
      "==============================================================================\n",
      "                 coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept      0.3455      0.731      0.472      0.637        -1.088     1.779\n",
      "won_lottry     3.5514      0.178     19.934      0.000         3.202     3.901\n",
      "male          -0.1622      0.164     -0.992      0.321        -0.483     0.158\n",
      "base_age      -0.1184      0.061     -1.946      0.052        -0.238     0.001\n",
      "==============================================================================\n",
      "\n",
      "\n",
      "==============================================================================\n",
      "                                  SECOND STAGE\n",
      "==============================================================================\n",
      "                 Generalized Linear Model Regression Results                  \n",
      "==============================================================================\n",
      "Dep. Variable:              finish8th   No. Observations:                 1171\n",
      "Model:                            GLM   Df Residuals:                     1167\n",
      "Model Family:                Binomial   Df Model:                            3\n",
      "Link Function:                  logit   Scale:                             1.0\n",
      "Method:                          IRLS   Log-Likelihood:                -696.65\n",
      "Date:                Thu, 19 Mar 2015   Deviance:                       1393.3\n",
      "Time:                        23:08:46   Pearson chi2:                 1.17e+03\n",
      "No. Iterations:                     6                                         \n",
      "======================================================================================\n",
      "                         coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
      "--------------------------------------------------------------------------------------\n",
      "Intercept              4.0756      0.604      6.753      0.000         2.893     5.258\n",
      "use_fin_aid_fitted     0.7743      0.192      4.036      0.000         0.398     1.150\n",
      "male                  -0.4175      0.130     -3.208      0.001        -0.673    -0.162\n",
      "base_age              -0.2919      0.048     -6.077      0.000        -0.386    -0.198\n",
      "======================================================================================\n"
     ]
    }
   ],
   "source": [
    "print \"==============================================================================\"\n",
    "print \"                                  FIRST STAGE\"\n",
    "print \"==============================================================================\"\n",
    "result = smf.glm(formula = \"use_fin_aid ~ won_lottry + male + base_age\", \n",
    "                 data=voucher_df,\n",
    "                 family=sm.families.Binomial()).fit()\n",
    "voucher_df['use_fin_aid_fitted']= result.predict()\n",
    "print result.summary()\n",
    "\n",
    "print\n",
    "print\n",
    "print \"==============================================================================\"\n",
    "print \"                                  SECOND STAGE\"\n",
    "print \"==============================================================================\"#\n",
    "result = smf.glm(formula = \" finish8th ~ use_fin_aid_fitted + male + base_age\", \n",
    "                 data=voucher_df,\n",
    "                 family=sm.families.Binomial()).fit()\n",
    "print result.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Interpreting the Local Average Treatment Effect "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we use IVE to \"recover\" the treatment effect, how can we actually describe the results? According to Murnane and Willett, \"an estimate of a treatment effect obtained by IV methods should be regarded as an estimated *local average treatment effect* (LATE). The chapter walks readers through the kinds of groups involved:\n",
    "\n",
    "<table>\n",
    "<thead>\n",
    "<tr style=\"border:2px solid #666;\"><td>**won_lottery = 1**</td><td>**won_lottery = 0**</td><td></td></tr>\n",
    "</thead>\n",
    "<tr style=\"background:#bbb\"><td>use_fin_aid=1<br>(*used financial aid form some source*)</td><td>use_fin_aid=0<br>(*did not use financial aid from any source*)</td><td style=\"background:#fff\">\"**Compliers**\"</td></tr>\n",
    "<tr><td>use_fin_aid=1<br>(used financial aid from some source)</td><td>use_fin_aid=1<br>(used financial aid from some source)</td><td style=\"background:#fff\">\"**Always-Takers**\"</td></tr>\n",
    "<tr style=\"background:#ddd\"><td>use_fin_aid=0<br>(did not use financial aid from any source)</td><td>use_fin_aid=0<br>(did not use financial aid from any source)</td><td style=\"background:#fff\">\"**Never-Takers**\"</td></tr>\n",
    "</table>\n",
    "\n",
    "Murnane and Willett offer a model that distinguishes among groups based on their compliance with \"the intent of the lottery\" (277), based on a paper by Angrist, Imbens and Rubin on \"[Identiﬁcation of Causal Effects Using Instrumental Variables](http://business.baylor.edu/scott_cunningham/teaching/angrist-imbens-and-rubin.pdf)\" (1996):\n",
    "* *Compliers* \"are willing to have their behavior determined by the outcomes of the lottery, regardless of the particular experimental conditions to which they were assigned\" (278).\n",
    "* *Always-Takers* \"are families who will find and make use of financial aid to pay private-school fees\" regardless of the lottery. They may find aid outside the lottery\n",
    "* *Never-takers* are the mirror image of always-takers: \"they will not make use of financial aid to pay childrens' fees at a private secondary school under any circumstances\" (278)\n",
    "* (there are other possible groups, like \"defiers\" (Gennetian et all, 2005) who always do the opposite of what investigators ask them to do, but we make the assumption of \"no defiers\" in this dataset)\n",
    "\n",
    "In this context, IV estimates of the **local average treatment effect** (LATE) for this quasi-experiment only applies to \"compliers\"--and not to never-takers or always-takers. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}