{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p class=\"title\">Confidence bounds for the mean of a bounded population: Binomial and Hypergeometric</p>\n",
    "\n",
    "## We will use basic combinatorics to find upper and lower confidence bounds for the mean of a bounded population.\n",
    "\n",
    "+ Elementary derivation for {0, 1} populations\n",
    "+ Obvious extension to 2-valued populations\n",
    "+ Thresholding for general bounded populations\n",
    "+ Trinomial and Multinomial bounds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison with the normal approximation\n",
    "Let's start by constructing two-sided confidence intervals for the same {0, 1} cases in which the Normal approximation did so badly.\n",
    "\n",
    "To find two-sided intervals, we find lower and upper confidence bounds at half the value of $\\alpha$. This is equivalent to constructing acceptance regions that are \"balanced\" in the sense that, under the null hypothesis, the chance of rejecting the hypothesis because $X$ is too big is equal to the chance of rejecting the hypothesis because $X$ is too small.\n",
    "\n",
    "The code tells the story:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is the first cell with code: set up the Python environment\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "import scipy.stats\n",
    "from scipy.stats import binom\n",
    "import pandas as pd\n",
    "from ipywidgets import interact, interactive, fixed\n",
    "import ipywidgets as widgets\n",
    "from IPython.display import clear_output, display, HTML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def binoLowerCL(n, x, cl = 0.975, inc=0.000001, p = None):\n",
    "    \"Lower confidence level cl confidence interval for Binomial p, for x successes in n trials\"\n",
    "    if p is None:\n",
    "            p = float(x)/float(n)\n",
    "    lo = 0.0\n",
    "    if (x > 0):\n",
    "            f = lambda q: cl - scipy.stats.binom.cdf(x-1, n, q)\n",
    "            lo = sp.optimize.brentq(f, 0.0, p, xtol=inc)\n",
    "    return lo\n",
    "\n",
    "def binoUpperCL(n, x, cl = 0.975, inc=0.000001, p = None):\n",
    "    \"Upper confidence level cl confidence interval for Binomial p, for x successes in n trials\"\n",
    "    if p is None:\n",
    "            p = float(x)/float(n)\n",
    "    hi = 1.0\n",
    "    if (x < n):\n",
    "            f = lambda q: scipy.stats.binom.cdf(x, n, q) - (1-cl)\n",
    "            hi = sp.optimize.brentq(f, p, 1.0, xtol=inc) \n",
    "    return hi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Simulated coverage probability and expected length of Student-t and Binomial confidence intervals for a {0, 1} population</h3><strong>Nominal coverage probability 95.0%</strong>.<br /><strong>Estimated from 1000 replications.</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fraction of 1s</th>\n",
       "      <th>sample size</th>\n",
       "      <th>Student-t cov</th>\n",
       "      <th>Binom cov</th>\n",
       "      <th>Student-t len</th>\n",
       "      <th>Binom len</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.001</td>\n",
       "      <td>25.0</td>\n",
       "      <td>2.8%</td>\n",
       "      <td>97.2%</td>\n",
       "      <td>0.0047</td>\n",
       "      <td>0.1391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.001</td>\n",
       "      <td>50.0</td>\n",
       "      <td>5.1%</td>\n",
       "      <td>99.9%</td>\n",
       "      <td>0.0041</td>\n",
       "      <td>0.0729</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.001</td>\n",
       "      <td>100.0</td>\n",
       "      <td>9.5%</td>\n",
       "      <td>99.6%</td>\n",
       "      <td>0.0038</td>\n",
       "      <td>0.038</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.001</td>\n",
       "      <td>400.0</td>\n",
       "      <td>34.6%</td>\n",
       "      <td>99.0%</td>\n",
       "      <td>0.0037</td>\n",
       "      <td>0.011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.010</td>\n",
       "      <td>25.0</td>\n",
       "      <td>20.8%</td>\n",
       "      <td>99.8%</td>\n",
       "      <td>0.0357</td>\n",
       "      <td>0.1518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.010</td>\n",
       "      <td>50.0</td>\n",
       "      <td>40.7%</td>\n",
       "      <td>98.6%</td>\n",
       "      <td>0.0361</td>\n",
       "      <td>0.0881</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.010</td>\n",
       "      <td>100.0</td>\n",
       "      <td>60.9%</td>\n",
       "      <td>97.5%</td>\n",
       "      <td>0.0297</td>\n",
       "      <td>0.052</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.010</td>\n",
       "      <td>400.0</td>\n",
       "      <td>90.8%</td>\n",
       "      <td>95.8%</td>\n",
       "      <td>0.0188</td>\n",
       "      <td>0.0221</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.100</td>\n",
       "      <td>25.0</td>\n",
       "      <td>93.4%</td>\n",
       "      <td>99.3%</td>\n",
       "      <td>0.2377</td>\n",
       "      <td>0.2631</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.100</td>\n",
       "      <td>50.0</td>\n",
       "      <td>87.9%</td>\n",
       "      <td>96.3%</td>\n",
       "      <td>0.1675</td>\n",
       "      <td>0.1812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.100</td>\n",
       "      <td>100.0</td>\n",
       "      <td>93.4%</td>\n",
       "      <td>95.2%</td>\n",
       "      <td>0.1183</td>\n",
       "      <td>0.126</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0.100</td>\n",
       "      <td>400.0</td>\n",
       "      <td>93.9%</td>\n",
       "      <td>95.1%</td>\n",
       "      <td>0.0588</td>\n",
       "      <td>0.061</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    fraction of 1s  sample size Student-t cov Binom cov Student-t len  \\\n",
       "0            0.001         25.0          2.8%     97.2%        0.0047   \n",
       "1            0.001         50.0          5.1%     99.9%        0.0041   \n",
       "2            0.001        100.0          9.5%     99.6%        0.0038   \n",
       "3            0.001        400.0         34.6%     99.0%        0.0037   \n",
       "4            0.010         25.0         20.8%     99.8%        0.0357   \n",
       "5            0.010         50.0         40.7%     98.6%        0.0361   \n",
       "6            0.010        100.0         60.9%     97.5%        0.0297   \n",
       "7            0.010        400.0         90.8%     95.8%        0.0188   \n",
       "8            0.100         25.0         93.4%     99.3%        0.2377   \n",
       "9            0.100         50.0         87.9%     96.3%        0.1675   \n",
       "10           0.100        100.0         93.4%     95.2%        0.1183   \n",
       "11           0.100        400.0         93.9%     95.1%        0.0588   \n",
       "\n",
       "   Binom len  \n",
       "0     0.1391  \n",
       "1     0.0729  \n",
       "2      0.038  \n",
       "3      0.011  \n",
       "4     0.1518  \n",
       "5     0.0881  \n",
       "6      0.052  \n",
       "7     0.0221  \n",
       "8     0.2631  \n",
       "9     0.1812  \n",
       "10     0.126  \n",
       "11     0.061  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Population of two values, {0, 1}, in various proportions.  Amounts to Binomial random variable\n",
    "ns = np.array([25, 50, 100, 400])  # sample sizes\n",
    "ps = np.array([.001, .01, 0.1])    # mixture fractions, proportion of 1s in the population\n",
    "alpha = 0.05  # 1- (confidence level)\n",
    "reps = int(1.0e3)   # just for demonstration\n",
    "vals = [0, 1]\n",
    "\n",
    "simTable = pd.DataFrame(columns=('fraction of 1s', 'sample size', 'Student-t cov',\\\n",
    "                                 'Binom cov', 'Student-t len', 'Binom len'))\n",
    "for p in ps:\n",
    "    popMean = p\n",
    "    for n in ns:\n",
    "        tCrit = sp.stats.t.ppf(q=1.0-alpha/2, df=n-1)\n",
    "        samMean = np.zeros(reps)\n",
    "        sam = sp.stats.binom.rvs(n, p, size=reps)\n",
    "        samMean = sam/float(n)\n",
    "        samSD = np.sqrt(samMean*(1-samMean)/(n-1))\n",
    "        coverT = (np.fabs(samMean-popMean) < tCrit*samSD).sum()\n",
    "        aveLenT = 2*(tCrit*samSD).mean()\n",
    "        coverB = 0\n",
    "        totLenB = 0.0\n",
    "        for r in range(int(reps)):  \n",
    "            lo = binoLowerCL(n, sam[r], cl=1.0-alpha/2)\n",
    "            hi = binoUpperCL(n, sam[r], cl=1.0-alpha/2)\n",
    "            coverB += ( p >= lo) & (p <= hi)\n",
    "            totLenB += hi-lo\n",
    "        simTable.loc[len(simTable)] =  p, n, str(100*float(coverT)/float(reps)) + '%', \\\n",
    "            str(100*float(coverB)/float(reps)) + '%',\\\n",
    "            str(round(aveLenT,4)),\\\n",
    "            str(round(totLenB/float(reps),4))\n",
    "#\n",
    "ansStr =  '<h3>Simulated coverage probability and expected length of Student-t and Binomial confidence intervals for a {0, 1} population</h3>' +\\\n",
    "          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
    "          '%</strong>.<br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'\n",
    "display(HTML(ansStr))\n",
    "display(simTable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The empirical coverage rates are gnerally much *higher* than 95% when $n$ is small, because of the discreteness of the Binomial distribution: to ensure that a test rejects _at most_ 5% of the time might require rejecting far less than 5% of the time. That is, the tests are conservative rather than exact.\n",
    "\n",
    "We could construct _randomized_ tests that have exactly level $\\alpha$, but we won't go there..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hypergeometric confidence bounds\n",
    "\n",
    "Consider $n$ uniform draws _without_ replacement from a $\\{0, 1\\}$ population of known size $N$ that contains an unknown number $G$ of 1s&mdash;i.e., a _simple random sample_ of size $n$.\n",
    "Let $X$ denote the number of 1s in the sample.\n",
    "Then $X$ has a Hypergeometric distribution with parameters $N$, $G$, and $n$.\n",
    "\n",
    "In particular,\n",
    "$$  \\mathbb P_{N, G, n} (X = k) = \\frac{{G \\choose k}{N-G \\choose n-k}}{{N \\choose n}}$$\n",
    "for $\\max(0, n − (N−G)) \\le k \\le  \\min(n, G)$.\n",
    "\n",
    "We can get sharper confidence bounds in this case by inverting hypergeometric tests instead of Binomial tests.\n",
    "\n",
    "The strategy is the same: construct a family of acceptance regions for each possible value of the parameter, then let the confidence set be the parameter values for which, given the observed data, the test would not reject.\n",
    "\n",
    "To find a one-sided lower confidence bound for $G$ in a Hypergeometric$(N, G, n)$ distribution, with $N$ and $n$ known, we would invert one-sided upper tests, that is, tests that reject the hypothesis $G = g$ when $X$ is large.  (The corresponding confidence bound for the population mean is the confidence bound for \n",
    "$G$, divided by $N$.)\n",
    "\n",
    "The form of the acceptance region for the test is:\n",
    "$$ A_G \\equiv \\ [0, a_G],$$\n",
    "where \n",
    "$$ a_G \\equiv \\min \\left \\{ k: \\sum_{i=k+1}^n \\frac{{G \\choose i}{N-G \\choose n-i}}{{N \\choose n}} \\le \\alpha \\right \\}.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Two-valued populations\n",
    "Clearly, if the population is known to contain only the values $\\{a, b\\}$ instead of the values $\\{0, 1\\}$, it's trivial to re-scale binomial or hypergeometric confidence bounds.\n",
    "\n",
    "If we observe the sum $Y$ of $n$ iid uniform draws from the population, \n",
    "- let $X \\equiv (Y - na)/(b-a)$\n",
    "- find the Binomial confidence bound for $p$ based on $X$ with $n$ trials\n",
    "- transform each endpoint $q$ of that interval to $(1-q)a + qb$ to get the corresponding endpoint of the interval for the mean of the original population.\n",
    "\n",
    "For example, suppose the population was known to contain only the values 1 and 10, but in unknown proportions. \n",
    "\n",
    "We draw an iid uniform sample of 50 items; the sample sum is 320 (corresponding to drawing \"1\" twenty times and \"10\" thirty times).\n",
    "\n",
    "Then a 95% upper confidence bound for the fraction of 10s in the population is <span class=\"code\">binoUpperCL(50, 30, cl=0.95)</span> = 0.7169, so a 95% upper confidence bound for the population mean is \n",
    "\n",
    "$$(1-0.7169)\\times 1 + 0.7169\\times 10) = 7.45182.$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thresholding for bounded populations\n",
    "\n",
    "The applications that motivate this inquiry generally call for __one-sided confidence bounds__ rather than confidence intervals. For instance:\n",
    "\n",
    "+ A plaintiff may want to show that, with high confidence, the damages are _at least_ \\$ $x$\n",
    "+ A prosecutor may want to show that, with high confidence, the accused committed fraud worth _at least_ \\$ $x$\n",
    "+ A financial auditor might want to certify that, with high confidence, a company's assets are overstated by _at most_ \\$$x$ and its liabilities are understated by _at most_ \\$$y$.\n",
    "+ A local election official might want determine whether, with high confidence, the error in tallying the votes is _not as large as_ the margin\n",
    "+ An online marketer might want to show that, with high confidence, its technology increases sales by _at least_ $x$%.\n",
    "+ A pharmaceutical company might want to show that, with high confidence, its headache remedy decreases the average duration of headaches by _at least_ $x$ hours, or that its Ebola vaccine increases the rate of survival by _at least_ $x$%.\n",
    "\n",
    "\n",
    "In general, _some_ constraint is needed to get even a one-sided (conservative or exact) confidence bound.\n",
    "\n",
    "In these problems, there are natural constraints, for instance:  \n",
    "\n",
    "+ Damages cannot be negative (the plaintiff doesn't owe the defendant&mdash;unless the defendant sues the plaintiff, becoming the plaintiff in a new action)\n",
    "+ Fraudulent billing is non-negative; the frauduent portion of a bill does not exceed total bill\n",
    "\n",
    "<p class=\"gap01\">\n",
    "   Consider the second example in the Normal Approximation notebook, a mixture of a uniform and a point mass at    zero.\n",
    "   Suppose for the moment that we are interested in an upper confidence bound, e.g., an upper bound on the    \n",
    "   overstatement of a collection of accounts, for the purpose of establishing that a company's books are fairly \n",
    "   represented.\n",
    "</p>\n",
    "\n",
    "<p class=\"gap01\">\n",
    "   We have an unknown population $\\{ x_j \\}_{j=1}^N$.\n",
    "   We want an upper confidence bound for $\\mu_x \\equiv \\frac{1}{N} \\sum_{j=1}^N x_j$.\n",
    "</p>\n",
    "   \n",
    "<p class=\"gap01\">\n",
    "   Suppose we have an a priori upper bound $u_j$ on the value $x_j$ of item $j$, $\\forall j$.\n",
    "   For simplicity, let's take $u_j = 1$.\n",
    "   Pick a threshold $t < 1$.\n",
    "   Imagine a new population $\\{ y_j \\}_{j=1}^N$, where \n",
    "</p>\n",
    "\n",
    "  $$y_j \\equiv \\left \\{ \\begin{array}{ll} t, & x_j \\le t \\cr 1, & x_j > t. \\end{array} \\right . $$\n",
    "\n",
    "<p class=\"gap01\">\n",
    "  Clearly $\\mu_y \\equiv \\frac{1}{N} \\sum_{j=1}^N y_j \\ge \\mu_x$, so an upper confidence bound for $\\mu_y$\n",
    "  is also an upper confidence bound for $\\mu_x$.\n",
    "  But we can find an upper confidence bound for $\\mu_y$ using a random sample with replacement and\n",
    "  the general two-value transformation of the Binomial,\n",
    "  as sketched above.\n",
    "</p>\n",
    "\n",
    "+ draw a random sample with replacement of size $n$ from the population\n",
    "+ let $X$ denote the number of items in the sample with value greater than $t$. Then $X$ has a Binomial distribution with parameters $n$ and $p$, where $p$ is the population fraction of items with value greater than $t$\n",
    "+ find an upper $1-\\alpha$ confidence bound $p^+$ for $p$ by inverting Binomial tests\n",
    "+ an upper $1-\\alpha$ confidence bound for $\\mu$ is $(1-p^+)t + p^+$.\n",
    "\n",
    "Let's see how this does compared to a one-sided Student-t interval.  We will estimate the coverage for a variety of thresholds $t$.  We will also estimate the average length of the intervals, to get an idea of the tradeoff between coverage and precision.\n",
    "\n",
    "It is perhaps discomfiting to have a tuning parameter $t$ in the method, but regardless of the choice of $t$, the intervals are guaranteed to be _conservative_ (true converage probability at least $1-\\alpha$.\n",
    "However, the expected _length_ of the interval depends on $t$ and on the underlying population of values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Simulated coverage probability and expected lengths of one-sided Student-t confidence intervals and threshold Binomial intervals for mixture of U[0,1] and pointmass at 0</h3><strong>Nominal coverage probability 95.0%</strong>. <br /><strong>Estimated from 1000 replications.</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mass at 0</th>\n",
       "      <th>sample size</th>\n",
       "      <th>Student-t cov</th>\n",
       "      <th>Bin t=0.2 cov</th>\n",
       "      <th>Bin t=0.1 cov</th>\n",
       "      <th>Bin t=0.01 cov</th>\n",
       "      <th>Bin t=0.001 cov</th>\n",
       "      <th>Student-t len</th>\n",
       "      <th>Bin t=0.2 len</th>\n",
       "      <th>Bin t=0.1 len</th>\n",
       "      <th>Bin t=0.01 len</th>\n",
       "      <th>Bin t=0.001 len</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.900</td>\n",
       "      <td>25.0</td>\n",
       "      <td>89.5%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.316</td>\n",
       "      <td>0.383</td>\n",
       "      <td>0.318</td>\n",
       "      <td>0.262</td>\n",
       "      <td>0.257</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.900</td>\n",
       "      <td>50.0</td>\n",
       "      <td>98.8%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.332</td>\n",
       "      <td>0.337</td>\n",
       "      <td>0.266</td>\n",
       "      <td>0.204</td>\n",
       "      <td>0.198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.900</td>\n",
       "      <td>100.0</td>\n",
       "      <td>99.9%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.334</td>\n",
       "      <td>0.311</td>\n",
       "      <td>0.236</td>\n",
       "      <td>0.17</td>\n",
       "      <td>0.163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.900</td>\n",
       "      <td>400.0</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.339</td>\n",
       "      <td>0.285</td>\n",
       "      <td>0.206</td>\n",
       "      <td>0.136</td>\n",
       "      <td>0.129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.990</td>\n",
       "      <td>25.0</td>\n",
       "      <td>22.9%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.047</td>\n",
       "      <td>0.301</td>\n",
       "      <td>0.214</td>\n",
       "      <td>0.137</td>\n",
       "      <td>0.129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.990</td>\n",
       "      <td>50.0</td>\n",
       "      <td>38.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.058</td>\n",
       "      <td>0.257</td>\n",
       "      <td>0.165</td>\n",
       "      <td>0.083</td>\n",
       "      <td>0.075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.990</td>\n",
       "      <td>100.0</td>\n",
       "      <td>60.2%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.069</td>\n",
       "      <td>0.234</td>\n",
       "      <td>0.139</td>\n",
       "      <td>0.055</td>\n",
       "      <td>0.046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.990</td>\n",
       "      <td>400.0</td>\n",
       "      <td>96.7%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.093</td>\n",
       "      <td>0.216</td>\n",
       "      <td>0.119</td>\n",
       "      <td>0.032</td>\n",
       "      <td>0.023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.999</td>\n",
       "      <td>25.0</td>\n",
       "      <td>2.3%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.004</td>\n",
       "      <td>0.291</td>\n",
       "      <td>0.203</td>\n",
       "      <td>0.123</td>\n",
       "      <td>0.115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.999</td>\n",
       "      <td>50.0</td>\n",
       "      <td>6.3%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.008</td>\n",
       "      <td>0.248</td>\n",
       "      <td>0.154</td>\n",
       "      <td>0.07</td>\n",
       "      <td>0.061</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.999</td>\n",
       "      <td>100.0</td>\n",
       "      <td>9.8%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.008</td>\n",
       "      <td>0.225</td>\n",
       "      <td>0.128</td>\n",
       "      <td>0.041</td>\n",
       "      <td>0.032</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0.999</td>\n",
       "      <td>400.0</td>\n",
       "      <td>32.5%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.015</td>\n",
       "      <td>0.207</td>\n",
       "      <td>0.108</td>\n",
       "      <td>0.019</td>\n",
       "      <td>0.01</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mass at 0  sample size Student-t cov Bin t=0.2 cov Bin t=0.1 cov  \\\n",
       "0       0.900         25.0         89.5%        100.0%        100.0%   \n",
       "1       0.900         50.0         98.8%        100.0%        100.0%   \n",
       "2       0.900        100.0         99.9%        100.0%        100.0%   \n",
       "3       0.900        400.0        100.0%        100.0%        100.0%   \n",
       "4       0.990         25.0         22.9%        100.0%        100.0%   \n",
       "5       0.990         50.0         38.0%        100.0%        100.0%   \n",
       "6       0.990        100.0         60.2%        100.0%        100.0%   \n",
       "7       0.990        400.0         96.7%        100.0%        100.0%   \n",
       "8       0.999         25.0          2.3%        100.0%        100.0%   \n",
       "9       0.999         50.0          6.3%        100.0%        100.0%   \n",
       "10      0.999        100.0          9.8%        100.0%        100.0%   \n",
       "11      0.999        400.0         32.5%        100.0%        100.0%   \n",
       "\n",
       "   Bin t=0.01 cov Bin t=0.001 cov Student-t len Bin t=0.2 len Bin t=0.1 len  \\\n",
       "0          100.0%          100.0%         0.316         0.383         0.318   \n",
       "1          100.0%          100.0%         0.332         0.337         0.266   \n",
       "2          100.0%          100.0%         0.334         0.311         0.236   \n",
       "3          100.0%          100.0%         0.339         0.285         0.206   \n",
       "4          100.0%          100.0%         0.047         0.301         0.214   \n",
       "5          100.0%          100.0%         0.058         0.257         0.165   \n",
       "6          100.0%          100.0%         0.069         0.234         0.139   \n",
       "7          100.0%          100.0%         0.093         0.216         0.119   \n",
       "8          100.0%          100.0%         0.004         0.291         0.203   \n",
       "9          100.0%          100.0%         0.008         0.248         0.154   \n",
       "10         100.0%          100.0%         0.008         0.225         0.128   \n",
       "11         100.0%          100.0%         0.015         0.207         0.108   \n",
       "\n",
       "   Bin t=0.01 len Bin t=0.001 len  \n",
       "0           0.262           0.257  \n",
       "1           0.204           0.198  \n",
       "2            0.17           0.163  \n",
       "3           0.136           0.129  \n",
       "4           0.137           0.129  \n",
       "5           0.083           0.075  \n",
       "6           0.055           0.046  \n",
       "7           0.032           0.023  \n",
       "8           0.123           0.115  \n",
       "9            0.07           0.061  \n",
       "10          0.041           0.032  \n",
       "11          0.019            0.01  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Nonstandard mixture: a pointmass at zero and a uniform[0,1]\n",
    "ns = np.array([25, 50, 100, 400])  # sample sizes\n",
    "ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass\n",
    "thresh = [0.2, 0.1, 0.01, .001]\n",
    "alpha = 0.05  # 1- (confidence level)\n",
    "reps = 1.0e3   # just for demonstration\n",
    "\n",
    "cols = ['mass at 0', 'sample size', 'Student-t cov']\n",
    "for i in range(len(thresh)):\n",
    "    cols.append('Bin t=' + str(thresh[i]) + ' cov')\n",
    "cols.append('Student-t len')\n",
    "for i in range(len(thresh)):\n",
    "    cols.append('Bin t=' + str(thresh[i]) + ' len')\n",
    "\n",
    "\n",
    "simTable = pd.DataFrame(columns=cols)\n",
    "\n",
    "for p in ps:\n",
    "    popMean = (1-p)*0.5  #  p*0 + (1-p)*.5\n",
    "    for n in ns:\n",
    "        tCrit = sp.stats.t.ppf(q=1-alpha, df=n-1)\n",
    "        coverT = 0    # coverage of t intervals\n",
    "        tUp = 0       # mean upper bound of t intervals\n",
    "        coverB = np.zeros(len(thresh))  # coverage of binomial threshold intervals\n",
    "        bUp = np.zeros(len(thresh))     # mean upper bound of binomial threshold intervals\n",
    "        for rep in range(int(reps)):\n",
    "            sam = np.random.uniform(size=n)\n",
    "            ptMass = np.random.uniform(size=n)\n",
    "            sam[ptMass < p] = 0.0\n",
    "            samMean = np.mean(sam)\n",
    "            samSD = np.std(sam, ddof=1)\n",
    "            tlim = samMean + tCrit*samSD\n",
    "            coverT += (popMean <= tlim)  # one-sided Student-t\n",
    "            tUp += tlim\n",
    "            for i in range(len(thresh)):\n",
    "                x = (sam > thresh[i]).sum()  # number of binomial \"successes\"\n",
    "                pPlus = binoUpperCL(n, x, cl=1-alpha)\n",
    "                blim = thresh[i]*(1.0-pPlus) + pPlus\n",
    "                coverB[i] += (popMean <= blim)\n",
    "                bUp[i] += blim\n",
    "        theRow = [p, n, str(100*float(coverT)/float(reps)) + '%']\n",
    "        for i in range(len(thresh)):\n",
    "            theRow.append(str(100*float(coverB[i])/float(reps)) + '%')\n",
    "        theRow.append(str(round(tUp/float(reps), 3)))\n",
    "        for i in range(len(thresh)):\n",
    "            theRow.append(str(round(bUp[i]/float(reps), 3)))\n",
    "        simTable.loc[len(simTable)] =  theRow\n",
    "#\n",
    "ansStr =  '<h3>Simulated coverage probability and expected lengths of one-sided Student-t confidence intervals and threshold ' +\\\n",
    "          'Binomial intervals for mixture of U[0,1] and pointmass at 0</h3>' +\\\n",
    "          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
    "          '%</strong>. <br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'\n",
    "\n",
    "display(HTML(ansStr))\n",
    "display(simTable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that in many cases the expected length of the Student-t interval is _greater_ than the length of the binomial interval, even though the coverage probability of the Student-t interval is smaller."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trinomial Bounds, Multinomial Bounds, and the Stringer Bound\n",
    "Instead of thresholding at one value (yielding a binomial variable: \"success\" means \"above threshold\"), we could discretize the support of $\\bar{X}$ into three or more ranges, and make inferences based on the trinomial or multinomial distribution that induces.\n",
    "\n",
    "Specifying the acceptance regions in a way that allows the tests to be inverted in a simple way is not simple (and there are issues with published methods).\n",
    "\n",
    "The Stringer Bound is well known in financial auditing. It amounts to combining confidence limits for data-dependent ranges (adaptive thresholds) into an overall confidence limit.\n",
    "Empirically, it is conservative, and Bickel has shown that it is asymptotically conservative, but as far as I know, there is no proof that it is conservative for finite samples.\n",
    "\n",
    "Moreover, computational evidence suggests that methods we will explore in later lectures perform better than multinomial methods and the Stringer bound in practice, so we will not dwell on them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next?\n",
    "Now we will consider some conservative methods for constructing lower confidence intervals for the mean of nonnegative populations\n",
    "\n",
    "- [Next: Confidence bounds from the Chebychev and Hoeffding Inequalities](hoeffding.ipynb)\n",
    "- [Previous: Duality between confidence sets and hypothesis tests](duality.ipynb)\n",
    "- [Index](index.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<link href='http://fonts.googleapis.com/css?family=Lora|Open+Sans' rel='stylesheet' type='text/css'>\n",
       "<style>\n",
       "\n",
       "font-family: 'Lora', serif;\n",
       "\n",
       ".MathJax_Display {\n",
       "  padding: 10px;\n",
       "}\n",
       "\n",
       "div.callout {\n",
       "  color: #000000;\n",
       "  background-color: #DDDDDD;\n",
       "  margin: 20px 20px 20px 20px;\n",
       "  border-style: solid;\n",
       "  border-width: 2px;\n",
       "  padding: 10px 10px;\n",
       "}\n",
       "\n",
       ".rendered_html {\n",
       "  color: #2C5494;\n",
       "  font-family: 'Lora', serif;\n",
       "  font-size: 140%;\n",
       "  line-height: 1.1;\n",
       "  margin: 0.5em 0;\n",
       "}\n",
       "\n",
       "div.cell {\n",
       "    width:900px;\n",
       "    margin-left:auto;\n",
       "    margin-right:auto;\n",
       "}\n",
       "\n",
       "div.text_cell_render {\n",
       "    width:800px;\n",
       "    margin-left:auto;\n",
       "    margin-right:auto;\n",
       "}\n",
       "\n",
       "\n",
       ".title {\n",
       "  font-family: 'Open Sans', sans-serif;\n",
       "  color: #4773D1;\n",
       "  font-size: 250%;\n",
       "  font-weight:bold;\n",
       "  line-height: 1.2;\n",
       "  margin: 0.5em 0;\n",
       "}\n",
       "\n",
       ".subtitle {\n",
       "  font-family: 'Open Sans', sans-serif;\n",
       "  color: #386BBC;\n",
       "  font-size: 180%;\n",
       "  font-weight:bold;\n",
       "  line-height: 1.2;\n",
       "  margin: 0.5em 0;\n",
       "}\n",
       "\n",
       ".slide-header, p.slide-header {\n",
       "  color: #4773D1;\n",
       "  font-size: 200%;\n",
       "  font-weight:bold;\n",
       "  margin: 0px 20px 10px;\n",
       "  page-break-before: always;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       ".rendered_html .code {\n",
       "  background-color: #999999;\n",
       "}\n",
       "\n",
       ".rendered_html h1 {\n",
       "  color: #7898DD;\n",
       "  font-family: 'Open Sans', sans-serif;\n",
       "  line-height: 1.2;\n",
       "  margin: 0.15em 0em 0.5em;\n",
       "  page-break-before: always;\n",
       "}\n",
       "\n",
       "\n",
       ".rendered_html h2 {\n",
       "  color: #4773D1;\n",
       "  line-height: 1.2;\n",
       "  margin: 1.1em 0em 0.5em;\n",
       "}\n",
       "\n",
       ".rendered_html h3 {\n",
       "  font-size: 100%;\n",
       "  line-height: 1.2;\n",
       "  margin: 1.1em 0em 0.5em;\n",
       "}\n",
       "\n",
       ".rendered_html .definition, .proposition, .proof, .theorem {\n",
       "    padding-top: 20px;\n",
       "    color: #222299;\n",
       "    font-size: 120%;\n",
       "    font-style: italic;\n",
       "}\n",
       "\n",
       ".definition, .proposition, .theorem {\n",
       "  background-color: #EEEEEE;\n",
       "  border-style: solid;\n",
       "  border-width: 2px;\n",
       "  padding-left: 20px;\n",
       "  padding-right: 20px;\n",
       "}\n",
       "\n",
       ".rendered_html .definition::before{\n",
       "    content: \"Definition:\";\n",
       "    background-color: #DDDDDD;\n",
       "    color: #222299;\n",
       "    font-size: 110%;\n",
       "    font-weight: bold;\n",
       "    font-style: normal;\n",
       "}\n",
       "\n",
       ".rendered_html .proposition::before{\n",
       "    content: \"Proposition:\";\n",
       "    background-color: #DDDDDD;\n",
       "    color: #222299;\n",
       "    font-size: 110%;\n",
       "    font-weight: bold;\n",
       "    font-style: normal;\n",
       "}\n",
       "\n",
       ".rendered_html .proof::before{\n",
       "    content: \"Proof:\";\n",
       "    background-color: #DDDDDD;\n",
       "    color: #222299;\n",
       "    font-size: 110%;\n",
       "    font-weight: bold;\n",
       "    font-style: normal;\n",
       "}\n",
       "\n",
       ".rendered_html .theorem::before{\n",
       "    content: \"Theorem:\";\n",
       "    background-color: #DDDDDD;\n",
       "    color: #222299;\n",
       "    font-size: 110%;\n",
       "    font-weight: bold;\n",
       "    font-style: normal;\n",
       "}\n",
       "\n",
       "\n",
       ".rendered_html ol li {\n",
       "  padding-top: 20px;\n",
       "  padding-bottom: -20px;\n",
       "  line-height: 1.5;\n",
       "}\n",
       "\n",
       ".rendered_html ul li {\n",
       "  padding-top: 0px;\n",
       "  padding-bottom: 0px;\n",
       "  line-height: 1.2;\n",
       "}\n",
       "\n",
       "li:first-of-type {\n",
       "  padding-top: -100px;\n",
       "}\n",
       "\n",
       ".input_prompt, .CodeMirror-lines, .output_area {\n",
       "  font-family: Consolas;\n",
       "  font-size: 120%;\n",
       "}\n",
       "\n",
       ".gap-above {\n",
       "  padding-top: 200px;\n",
       "}\n",
       "\n",
       ".gap01 {\n",
       "  padding-top: 10px;\n",
       "}\n",
       "\n",
       ".gap05 {\n",
       "  padding-top: 50px;\n",
       "}\n",
       "\n",
       ".gap1 {\n",
       "  padding-top: 100px;\n",
       "}\n",
       "\n",
       ".gap2 {\n",
       "  padding-top: 200px;\n",
       "}\n",
       "\n",
       ".gap3 {\n",
       "  padding-top: 300px;\n",
       "}\n",
       "\n",
       ".emph {\n",
       "  color: #386BBC;\n",
       "}\n",
       "\n",
       ".warn {\n",
       "  color: red;\n",
       "}\n",
       "\n",
       ".center {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       ".nb_link {\n",
       "    padding-bottom: 0.5em;\n",
       "}\n",
       "\n",
       "</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%run talkTools.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}