{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Analysis of categorical data\n",
    "\n",
    "- Analysis of one proportion\n",
    "- Chi-square test\n",
    "- Fisher exact test\n",
    "- Cochran's Q test\n",
    "- McNemar test\n",
    "\n",
    "Author:  Thomas Haslwanter, Date:    Feb-2017"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Analysis of one proportion\n",
    "\n",
    "Calculate the confidence intervals of the population, based on a given data sample.\n",
    "\n",
    "*Suppose a general practitioner chooses a random sample of 215 women from\n",
    "the patient register for her general practice, and finds that 39 of them\n",
    "have a history of suffering from asthma. What is the confidence interval\n",
    "for the prevalence of asthma?*  (The data are taken from Altman, chapter 10.2.1:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ONE PROPORTION\n",
      "The confidence interval for the given sample is 0.130 to 0.233\n"
     ]
    }
   ],
   "source": [
    "# Get the data\n",
    "numTotal = 215\n",
    "numPositive = 39\n",
    "\n",
    "# Calculate the confidence intervals\n",
    "p = float(numPositive)/numTotal\n",
    "se = np.sqrt(p*(1-p)/numTotal)\n",
    "td = stats.t(numTotal-1)\n",
    "ci = p + np.array([-1,1])*td.isf(0.025)*se\n",
    "\n",
    "# Print them\n",
    "print('ONE PROPORTION')\n",
    "print('The confidence interval for the given sample is {0:5.3f} to {1:5.3f}'.format(\n",
    "    ci[0], ci[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Chi-square test to a 2x2 table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Data are taken from Altman, Table 10.10:\n",
    "\n",
    "*Comparison of number of hours swimming by swimmers with or without erosion of dental enamel:*\n",
    "    >= 6h: 32 yes, 118 no\n",
    "    <  6h: 17 yes, 127 no\n",
    "    \n",
    "The calculations are done with and without Yate's continuity\n",
    "correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CHI SQUARE\n",
      "The corrected chi2 value is 4.141, with p=0.042\n",
      "The uncorrected chi2 value is 4.802, with p=0.028\n"
     ]
    }
   ],
   "source": [
    "# Enter the data\n",
    "obs = np.array([[32, 118], [17, 127]])\n",
    "\n",
    "# Calculate the chi-square test\n",
    "chi2_corrected = stats.chi2_contingency(obs, correction=True)\n",
    "chi2_uncorrected = stats.chi2_contingency(obs, correction=False)\n",
    "\n",
    "# Print the result\n",
    "print('CHI SQUARE')\n",
    "print('The corrected chi2 value is {0:5.3f}, with p={1:5.3f}'.format(chi2_corrected[0], chi2_corrected[1]))\n",
    "print('The uncorrected chi2 value is {0:5.3f}, with p={1:5.3f}'.format(chi2_uncorrected[0], chi2_uncorrected[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Fisher's Exact Test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "*Spectacle wearing among juvenile delinquensts and non-delinquents who failed a vision test*\n",
    "\n",
    "- Spectecle wearers: 1 delinquent, 5 non-delinquents\n",
    "- non-spectacle wearers: 8 delinquents, 2 non-delinquents'''\n",
    "\n",
    "(Data are taken from Altman, Table 10.14)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The probability of obtaining a distribution at least as extreme as the one that was actually observed, assuming that the null hypothesis is true, is: 0.035.\n"
     ]
    }
   ],
   "source": [
    "# Enter the data\n",
    "obs = np.array([[1,5], [8,2]])\n",
    "\n",
    "# Calculate the Fisher Exact Test\n",
    "fisher_result = stats.fisher_exact(obs)\n",
    "\n",
    "# Print the result\n",
    "print('The probability of obtaining a distribution at least as extreme '\n",
    "+ 'as the one that was actually observed, assuming that the null ' +\n",
    "    'hypothesis is true, is: {0:5.3f}.'.format(fisher_result[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Cochran's Q test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "*12 subjects are asked to perform 3 tasks. The outcome of each task is \"success\" or \n",
    "\"failure\". The results are coded 0 for failure and 1 for success. In the example, subject 1 was successful\n",
    "in task 2, but failed tasks 1 and 3. Is there a difference between the performance on the three tasks?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q = 8.667, p = 0.013\n",
      "There is a significant difference between the three tasks.\n"
     ]
    }
   ],
   "source": [
    "from statsmodels.sandbox.stats.runs import cochrans_q\n",
    "import pandas as pd\n",
    "\n",
    "tasks = np.array([[0,1,1,0,1,0,0,1,0,0,0,0],\n",
    "                  [1,1,1,0,0,1,0,1,1,1,1,1],\n",
    "                  [0,0,1,0,0,1,0,0,0,0,0,0]])\n",
    "\n",
    "# I prefer a DataFrame here, as it indicates directly what the values mean\n",
    "df = pd.DataFrame(tasks.T, columns = ['Task1', 'Task2', 'Task3'])\n",
    "\n",
    "# --- >>> START stats <<< ---\n",
    "(Q, pVal) = cochrans_q(df)\n",
    "# --- >>> STOP stats <<< ---\n",
    "\n",
    "print('Q = {0:5.3f}, p = {1:5.3f}'.format(Q, pVal))\n",
    "if pVal < 0.05:\n",
    "    print(\"There is a significant difference between the three tasks.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## McNemar test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "*McNemars Test should be run in the \"exact\" version, even though approximate formulas are\n",
    "typically given in the lecture scripts. Just ignore the statistic that is returned, because\n",
    "it is different for the two options.*\n",
    "\n",
    "*In the following example, a researcher attempts to determine if a drug has an effect on a\n",
    "particular disease. Counts of individuals are given in the table, with the diagnosis\n",
    "(disease: present or absent) before treatment given in the rows, and the diagnosis\n",
    "after treatment in the columns. The test requires the same subjects to be included in\n",
    "the before-and-after measurements (matched pairs).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p = 4.434e-06\n",
      "There was a significant change in the disease by the treatment.\n"
     ]
    }
   ],
   "source": [
    "from statsmodels.sandbox.stats.runs import mcnemar\n",
    "\n",
    "f_obs = np.array([[101, 121],[59, 33]])\n",
    "(statistic, pVal) = mcnemar(f_obs)\n",
    "\n",
    "print('p = {0:5.3e}'.format(pVal))\n",
    "if pVal < 0.05:\n",
    "    print(\"There was a significant change in the disease by the treatment.\") "
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}