{
 "metadata": {
  "name": "linear_models"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook introduces the use of pandas and the formula framework in statsmodels in the context of linear modeling."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**It is based heavily on Jonathan Taylor's [class notes that use R](http://www.stanford.edu/class/stats191/interactions.html)**"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "import pandas\n",
      "import numpy as np\n",
      "\n",
      "from statsmodels.formula.api import ols\n",
      "from statsmodels.graphics.api import interaction_plot, abline_plot, qqplot\n",
      "from statsmodels.stats.api import anova_lm"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 1: IT salary data"
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "Outcome:    S, salaries for IT staff in a corporation\n",
      "Predictors: X, experience in years\n",
      "            M, managment, 2 levels, 0=non-management, 1=management\n",
      "            E, education, 3 levels, 1=Bachelor's, 2=Master's, 3=Ph.D"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "url = 'http://stats191.stanford.edu/data/salary.table'\n",
      "salary_table = pandas.read_table(url) # needs pandas 0.7.3\n",
      "salary_table.to_csv('salary.table', index=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print salary_table.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "E = salary_table.E # Education\n",
      "M = salary_table.M # Management\n",
      "X = salary_table.X # Experience\n",
      "S = salary_table.S # Salary"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's explore the data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(10,8))\n",
      "ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary',\n",
      "            xlim=(0, 27), ylim=(9600, 28800))\n",
      "symbols = ['D', '^']\n",
      "man_label = [\"Non-Mgmt\", \"Mgmt\"]\n",
      "educ_label = [\"Bachelors\", \"Masters\", \"PhD\"]\n",
      "colors = ['r', 'g', 'blue']\n",
      "factor_groups = salary_table.groupby(['E','M'])\n",
      "for values, group in factor_groups:\n",
      "    i,j = values\n",
      "    label = \"%s - %s\" % (man_label[j], educ_label[i-1])\n",
      "    ax.scatter(group['X'], group['S'], marker=symbols[j], color=colors[i-1],\n",
      "               s=350, label=label)\n",
      "ax.legend(scatterpoints=1, markerscale=.7, labelspacing=1);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Fit a linear model\n",
      "\n",
      "$$S_i = \\beta_0 + \\beta_1X_i + \\beta_2E_{i2} + \\beta_3E_{i3} + \\beta_4M_i + \\epsilon_i$$\n",
      "\n",
      "where\n",
      "\n",
      "$$ E_{i2}=\\cases{1,&if $E_i=2$;\\cr 0,&otherwise. \\cr}$$ \n",
      "$$ E_{i3}=\\cases{1,&if $E_i=3$;\\cr 0,&otherwise. \\cr}$$ \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "formula = 'S ~ C(E) + C(M) + X'\n",
      "lm = ols(formula, salary_table).fit()\n",
      "print lm.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Aside: Contrasts (see contrasts notebook)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Look at the design matrix created for us. Every results instance has a reference to the model."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lm.model.exog[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since we initially passed in a DataFrame, we have a transformed DataFrame available."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print lm.model.data.orig_exog.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There is a reference to the original untouched data in"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print lm.model.data.frame.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you use the formula interface, statsmodels remembers this transformation. Say you want to know the predicted salary for someone with 12 years experience and a Master's degree who is in a management position"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lm.predict({'X' : [12], 'M' : [1], 'E' : [2]})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So far we've assumed that the effect of experience is the same for each level of education and professional role.\n",
      "Perhaps this assumption isn't merited. We can formally test this using some interactions."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can start by seeing if our model assumptions are met. Let's look at a residuals plot."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And some formal tests"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Plot the residuals within the groups separately."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "resid = lm.resid"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(12,8))\n",
      "xticks = []\n",
      "ax = fig.add_subplot(111, xlabel='Group (E, M)', ylabel='Residuals')\n",
      "for values, group in factor_groups:\n",
      "    i,j = values\n",
      "    xticks.append(str((i, j)))\n",
      "    group_num = i*2 + j - 1 # for plotting purposes\n",
      "    x = [group_num] * len(group)\n",
      "    ax.scatter(x, resid[group.index], marker=symbols[j], color=colors[i-1],\n",
      "            s=144, edgecolors='black')\n",
      "ax.set_xticks([1,2,3,4,5,6])\n",
      "ax.set_xticklabels(xticks)\n",
      "ax.axis('tight');\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Add an interaction between salary and experience, allowing different intercepts for level of experience.\n",
      "\n",
      "$$S_i = \\beta_0+\\beta_1X_i+\\beta_2E_{i2}+\\beta_3E_{i3}+\\beta_4M_i+\\beta_5E_{i2}X_i+\\beta_6E_{i3}X_i+\\epsilon_i$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interX_lm = ols('S ~ C(E)*X + C(M)', salary_table).fit()\n",
      "print interX_lm.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Test that $\\beta_5 = \\beta_6 = 0$. We can use anova_lm or we can use an F-test."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print anova_lm(lm, interX_lm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print interX_lm.f_test('C(E)[T.2]:X = C(E)[T.3]:X = 0')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print interX_lm.f_test([[0,0,0,0,0,1,-1],[0,0,0,0,0,0,1]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The contrasts are created here under the hood by patsy."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Recall that F-tests are of the form $R\\beta = q$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "LC = interX_lm.model.data.orig_exog.design_info.linear_constraint('C(E)[T.2]:X = C(E)[T.3]:X = 0')\n",
      "print LC.coefs\n",
      "print LC.constants"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Interact education with management"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interM_lm = ols('S ~ X + C(E)*C(M)', salary_table).fit()\n",
      "print interM_lm.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print anova_lm(lm, interM_lm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "infl = interM_lm.get_influence()\n",
      "resid = infl.resid_studentized_internal"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='X', ylabel='standardized resids')\n",
      "\n",
      "for values, group in factor_groups:\n",
      "    i,j = values\n",
      "    idx = group.index\n",
      "    ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],\n",
      "            s=144, edgecolors='black')\n",
      "ax.axis('tight');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There looks to be an outlier."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "outl = interM_lm.outlier_test('fdr_bh')\n",
      "outl.sort('unadj_p', inplace=True)\n",
      "print outl"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "idx = salary_table.index.drop(32)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print idx"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lm32 = ols('S ~ C(E) + X + C(M)', data=salary_table, subset=idx).fit()\n",
      "print lm32.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interX_lm32 = ols('S ~ C(E) * X + C(M)', data=salary_table, subset=idx).fit()\n",
      "print interX_lm32.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table3 = anova_lm(lm32, interX_lm32)\n",
      "print table3"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interM_lm32 = ols('S ~ X + C(E) * C(M)', data=salary_table, subset=idx).fit()\n",
      "print anova_lm(lm32, interM_lm32)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Re-plotting the residuals"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "resid = interM_lm32.get_influence().summary_frame()['standard_resid']\n",
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='X[~[32]]', ylabel='standardized resids')\n",
      "\n",
      "for values, group in factor_groups:\n",
      "    i,j = values\n",
      "    idx = group.index\n",
      "    ax.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],\n",
      "            s=144, edgecolors='black')\n",
      "ax.axis('tight');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A final plot of the fitted values"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lm_final = ols('S ~ X + C(E)*C(M)', data=salary_table.drop([32])).fit()\n",
      "mf = lm_final.model.data.orig_exog\n",
      "lstyle = ['-','--']\n",
      "\n",
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='Experience', ylabel='Salary')\n",
      "\n",
      "for values, group in factor_groups:\n",
      "    i,j = values\n",
      "    idx = group.index\n",
      "    ax.scatter(X[idx], S[idx], marker=symbols[j], color=colors[i-1],\n",
      "            s=144, edgecolors='black')\n",
      "    # drop NA because there is no idx 32 in the final model\n",
      "    ax.plot(mf.X[idx].dropna(), lm_final.fittedvalues[idx].dropna(),\n",
      "            ls=lstyle[j], color=colors[i-1])\n",
      "ax.axis('tight');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "From our first look at the data, the difference between Master's and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management, M and education, E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "U = S - X * interX_lm32.params['X']\n",
      "U.name = 'Salary|X'\n",
      "\n",
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111)\n",
      "ax = interaction_plot(E, M, U, colors=['red','blue'], markers=['^','D'],\n",
      "        markersize=10, ax=ax)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Minority Employment Data - ABLine plotting"
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "TEST  - Job Aptitude Test Score\n",
      "ETHN  - 1 if minority, 0 otherwise\n",
      "JPERF - Job performance evaluation"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "try:\n",
      "    minority_table = pandas.read_table('minority.table')\n",
      "except: # don't have data already\n",
      "    url = 'http://stats191.stanford.edu/data/minority.table'\n",
      "    minority_table = pandas.read_table(url)\n",
      "    minority_table.to_csv('minority.table', sep=\"\\t\", index=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "factor_group = minority_table.groupby(['ETHN'])\n",
      "\n",
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')\n",
      "colors = ['purple', 'green']\n",
      "markers = ['o', 'v']\n",
      "for factor, group in factor_group:\n",
      "    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],\n",
      "                marker=markers[factor], s=12**2)\n",
      "ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "min_lm = ols('JPERF ~ TEST', data=minority_table).fit()\n",
      "print min_lm.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')\n",
      "for factor, group in factor_group:\n",
      "    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],\n",
      "                marker=markers[factor], s=12**2)\n",
      "ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left')\n",
      "fig = abline_plot(model_results = min_lm, ax=ax)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "min_lm2 = ols('JPERF ~ TEST + TEST:ETHN', data=minority_table).fit()\n",
      "print min_lm2.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')\n",
      "for factor, group in factor_group:\n",
      "    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],\n",
      "                marker=markers[factor], s=12**2)\n",
      "\n",
      "fig = abline_plot(intercept = min_lm2.params['Intercept'],\n",
      "                 slope = min_lm2.params['TEST'], ax=ax, color='purple')\n",
      "ax = fig.axes[0]\n",
      "fig = abline_plot(intercept = min_lm2.params['Intercept'],\n",
      "        slope = min_lm2.params['TEST'] + min_lm2.params['TEST:ETHN'],\n",
      "        ax=ax, color='green')\n",
      "ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "min_lm3 = ols('JPERF ~ TEST + ETHN', data=minority_table).fit()\n",
      "print min_lm3.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, xlabel='TEST', ylabel='JPERF')\n",
      "for factor, group in factor_group:\n",
      "    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],\n",
      "                marker=markers[factor], s=12**2)\n",
      "\n",
      "fig = abline_plot(intercept = min_lm3.params['Intercept'],\n",
      "                 slope = min_lm3.params['TEST'], ax=ax, color='purple')\n",
      "\n",
      "ax = fig.axes[0]\n",
      "fig = abline_plot(intercept = min_lm3.params['Intercept'] + min_lm3.params['ETHN'],\n",
      "        slope = min_lm3.params['TEST'], ax=ax, color='green')\n",
      "ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "min_lm4 = ols('JPERF ~ TEST * ETHN', data=minority_table).fit()\n",
      "print min_lm4.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig = plt.figure(figsize=(12,8))\n",
      "ax = fig.add_subplot(111, ylabel='JPERF', xlabel='TEST')\n",
      "for factor, group in factor_group:\n",
      "    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],\n",
      "                marker=markers[factor], s=12**2)\n",
      "\n",
      "fig = abline_plot(intercept = min_lm4.params['Intercept'],\n",
      "                 slope = min_lm4.params['TEST'], ax=ax, color='purple')\n",
      "ax = fig.axes[0]\n",
      "fig = abline_plot(intercept = min_lm4.params['Intercept'] + min_lm4.params['ETHN'],\n",
      "        slope = min_lm4.params['TEST'] + min_lm4.params['TEST:ETHN'],\n",
      "        ax=ax, color='green')\n",
      "ax.legend(['ETHN == 1', 'ETHN == 0'], scatterpoints=1, loc='upper left');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Is there any effect of ETHN on slope or intercept?\n",
      "<br />\n",
      "Y ~ TEST vs. Y ~ TEST + ETHN + ETHN:TEST"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table5 = anova_lm(min_lm, min_lm4)\n",
      "print table5"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Is there any effect of ETHN on intercept?\n",
      "<br />\n",
      "Y ~ TEST vs. Y ~ TEST + ETHN"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table6 = anova_lm(min_lm, min_lm3)\n",
      "print table6"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Is there any effect of ETHN on slope?\n",
      "<br />\n",
      "Y ~ TEST vs. Y ~ TEST + ETHN:TEST"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table7 = anova_lm(min_lm, min_lm2)\n",
      "print table7"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Is it just the slope or both?\n",
      "<br />\n",
      "Y ~ TEST + ETHN:TEST vs Y ~ TEST + ETHN + ETHN:TEST"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "table8 = anova_lm(min_lm2, min_lm4)\n",
      "print table8"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Two Way ANOVA - Kidney failure data"
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "Weight - (1,2,3) - Level of weight gan between treatments\n",
      "Duration - (1,2) - Level of duration of treatment\n",
      "Days - Time of stay in hospital"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "try:\n",
      "    kidney_table = pandas.read_table('kidney.table')\n",
      "except:\n",
      "    url = 'http://stats191.stanford.edu/data/kidney.table'\n",
      "    kidney_table = pandas.read_table(url, delimiter=\" *\")\n",
      "    kidney_table.to_csv(\"kidney.table\", sep=\"\\t\", index=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Explore the dataset, it's a balanced design\n",
      "print kidney_table.groupby(['Weight', 'Duration']).size()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "kt = kidney_table\n",
      "fig = plt.figure(figsize=(10,8))\n",
      "ax = fig.add_subplot(111)\n",
      "fig = interaction_plot(kt['Weight'], kt['Duration'], np.log(kt['Days']+1),\n",
      "        colors=['red', 'blue'], markers=['D','^'], ms=10, ax=ax)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "$$Y_{ijk} = \\mu + \\alpha_i + \\beta_j + \\left(\\alpha\\beta\\right)_{ij}+\\epsilon_{ijk}$$\n",
      "\n",
      "with \n",
      "\n",
      "$$\\epsilon_{ijk}\\sim N\\left(0,\\sigma^2\\right)$$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "help(anova_lm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Things available in the calling namespace are available in the formula evaluation namespace"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "kidney_lm = ols('np.log(Days+1) ~ C(Duration) * C(Weight)', data=kt).fit()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "ANOVA Type-I Sum of Squares\n",
      "<br /><br />\n",
      "SS(A) for factor A. <br />\n",
      "SS(B|A) for factor B. <br />\n",
      "SS(AB|B, A) for interaction AB. <br />"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print anova_lm(kidney_lm)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "ANOVA Type-II Sum of Squares\n",
      "<br /><br />\n",
      "SS(A|B) for factor A. <br />\n",
      "SS(B|A) for factor B. <br />"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print anova_lm(kidney_lm, typ=2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "ANOVA Type-III Sum of Squares\n",
      "<br /><br />\n",
      "SS(A|B, AB) for factor A. <br />\n",
      "SS(B|A, AB) for factor B. <br />"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print anova_lm(ols('np.log(Days+1) ~ C(Duration, Sum) * C(Weight, Poly)', \n",
      "                   data=kt).fit(), typ=3)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Excercise: Find the 'best' model for the kidney failure dataset"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}