{
 "metadata": {
  "name": "",
  "signature": "sha256:6cf3ee9e9432c59a53fa4242f34c0275b9073842a6f526c20c0f9b078d7f251f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Homework 4: Can you predict the Midterm Elections?\n",
      "\n",
      "Due: Monday, November 3, 2014 11:59 PM\n",
      "\n",
      "<a href=https://raw.githubusercontent.com/cs109/2014/master/homework/HW4.ipynb download=HW4.ipynb> Download this assignment</a>\n",
      "\n",
      "#### Submission Instructions\n",
      "To submit your homework, create a folder named lastname_firstinitial_hw# and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work. For the competition (problem 4), we will post a link on Piazza to a Google Form for you to submit your predictions. \n",
      "\n",
      "\n",
      "---\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Introduction\n",
      "\n",
      "**Add Introduction**\n",
      "\n",
      "You will use the [HuffPost Pollster API](http://elections.huffingtonpost.com/pollster/api) to extract the polls for the current 2014 Senate Midterm Elections and provide a final prediction of the result of each state.\n",
      "\n",
      "#### Data\n",
      "\n",
      "We will use the polls from the [2014 Senate Midterm Elections](http://elections.huffingtonpost.com/pollster) from the [HuffPost Pollster API](http://elections.huffingtonpost.com/pollster/api). \n",
      "\n",
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Problem 1: Data Wrangling\n",
      "\n",
      "We will read in the polls from the [2014 Senate Midterm Elections](http://elections.huffingtonpost.com/pollster) from the [HuffPost Pollster API](http://elections.huffingtonpost.com/pollster/api) and create a dictionary of DataFrames as well a master table information for each race."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 1(a)\n",
      "\n",
      "Read in [this JSON object](http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate) containing the polls for the 2014 Senate Elections using the HuffPost API. Call this JSON object `info`.  This JSON object is imported as a list in Python where each element contains the information for one race.  Use the function `type` to confirm the that `info` is a list. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 1(b)\n",
      "\n",
      "For each element of the list in `info` extract the state. We should have one poll per state, but we do not. Why?\n",
      "\n",
      "**Hint**: Use the internet to find out information on the races in each state that has more than one entry. Eliminate entries of the list that represent races that are not happening."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "** Your answer here: **"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 1(c)\n",
      "\n",
      "Create a dictionary of pandas DataFrames called `polls` keyed by the name of the election (a string). Each value in the dictionary should contain the polls for one of the races."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 1(d)\n",
      "\n",
      "Now create a master table information containing information about each race. Create a pandas DataFrame called `candidates` with rows containing information about each race. The `candidates` DataFrame should have the following columns: \n",
      "\n",
      "1. `State` = the state where the race is being held\n",
      "2. `R` = name of republican candidate\n",
      "3. `D` = name of non-republican candidate (Democrate or Independent) \n",
      "4. `incumbent` = R, D or NA\n",
      "\n",
      "**Hint**: You will need a considerable amount of data wrangling for this."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Problem 2: Confidence Intervals\n",
      "\n",
      "Compute a 99% confidence interval for each state. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 2(a)\n",
      "\n",
      "Assume you have $M$ polls with sample sizes $n_1, \\dots, n_M$. If the polls are independent, what is the average of the variances of each poll if the true proportion is $p$?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "** Your answer here: **"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 2(b)\n",
      "\n",
      "Compute the square root of these values in Problem 2(a) for the republican candidates in each race. Then, compute the standard deviations of the observed poll results for each race. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 2(c) \n",
      "\n",
      "Plot observed versus theoretical (average of the theoretical SDs) with the area of the point proportional to number of polls. How do these compare?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "** Your answer here: **"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 2(d)\n",
      "\n",
      "Repeat Problem 2(c) but include only the most recent polls from the last two months. Do they match better or worse or the same? Can we just trust the theoretical values?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "** Your answer here: **"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 2(e)\n",
      "\n",
      "Create a scatter plot with each point representing one state. Is there one or more races that are outlier in that it they have much larger variabilities than expected ? Explore the original poll data and explain why the discrepancy?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "** Your answer here: **"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 2(f)\n",
      "\n",
      "Construct confidence intervals for the difference in each race. Use either theoretical or data driven estimates of the standard error depending on your answer to this question. Use the results in Problem 2(e), to justify your choice.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Problem 3: Prediction and Posterior Probabilities\n",
      "\n",
      "Perform a Bayesian analysis to predict the probability of Republicans winning in each state then provide a posterior distribution of the number of republicans in the senate."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 3(a)\n",
      "\n",
      "First, we define a Bayesian model for each race. The prior for the difference $\\theta$ between republicans and democtrats will be $N(\\mu,\\tau^2)$. Say before seeing poll data you have no idea who is going to win, what should $\\mu$ be? How about $\\tau$, should it be large or small? "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "** Your answer here: **"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 3(b)\n",
      "\n",
      "What is the distribution of $d$ conditioned on $\\theta$. What is the posterior distribution of $\\theta | d$? \n",
      "\n",
      "**Hint**: Use normal approximation. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Your answer here:**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 3(c)\n",
      "\n",
      "The prior represents what we think before hand. We do not know who is expected to win, so we assume $\\mu=0$. For this problem estimate $\\tau$ using the observed differences across states (Hint: $\\tau$ represents the standard deviation of a typical difference). Compute the posterior mean for each state and plot it against original average. Is there much change? Why or why not? "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Your answer here:**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 3(d)\n",
      "\n",
      "For each state, report a probabilty of Republicans winning. How does your answer here compare to the other aggregators?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Your answer here:**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 3(e)\n",
      "\n",
      "Use the posterior distributions in a Monte Carlo simulation to generate election results. In each simulation compute the total number of seats the Republican control. Show a histogram of these results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Problem 4\n",
      "\n",
      "Predict the results for the 2014 Midterm Elections. We will have a three competitions with the terms for scoring entries described above. For both questions below, **explain** or provide commentary on how you arrived at your predictions including code. \n",
      "\n",
      "**Hint**: Use election results from 2010, 2012 to build and test models."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 4(a)\n",
      "\n",
      "Predict the number of Republican senators. You may provide an interval. Smallest interval that includes the election day result wins. \n",
      "\n",
      "**Note**: we want the total so add the numbers of those that are not up for election."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Provide an explanation of methodology here**:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 4(b)\n",
      "\n",
      "Predict the R-D difference in each state. The predictions that minimize the residual sum of squares between predicted and observed differences wins."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Provide an explanation of methodology here**:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Problem 4(c)\n",
      "\n",
      "Report a confidence interval for the R-D difference in each state. If the election day result falls outside your confidence interval in more than two states you are eliminated. For those surviving this cutoff, we will add up the size of all confidence intervals and sum. The smallest total length of confidence interval wins. \n",
      "\n",
      "**Note**: you can use Bayesian credible intervals or whatever else you want. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "### Your code here ###"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Provide an explanation of methodology here**:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Submission Instructions\n",
      "\n",
      "To submit your homework, create a folder named **lastname_firstinitial_hw#** and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code.  Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. *If we cannot access your work because these directions are not followed correctly, we will not grade your work.*\n"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}