{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Kaplan-Wald Confidence Bound for a Nonnegative Mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section presents an approach to finding lower confidence bounds for the mean of a nonnegative random variable described in H.M. Kaplan's (no longer online) website, http://printmacroj.com/martMean.htm. (An Internet Archive version is available at http://web.archive.org/web/20150220073515/http://printmacroj.com/martMean.htm.) That work fleshes out an idea due to Wald (1945, 2004), and is closely related to a technique presented in Kaplan (1987) based on martingale theory.\n",
    "\n",
    "We start with a version for sampling with replacement, then develop finite-population (sampling without replacement) versions.\n",
    "\n",
    "For another derivation of the IID version, see Stark & Teague (2014) https://www.usenix.org/system/files/jets/issues/0301/overview/jets_0301-stark.pdf\n",
    "\n",
    "\n",
    "We have an IID sequence of random variables $X_1, X_2, \\ldots$ such that $\\mathbb P \\{X_j \\ge 0 \\} = 1$. Let $F$ be their common distribution function. We seek a lower confidence bound for their common expectation $\\mu := \\mathbb{E} X_1 = \\int_0^\\infty x dF$. \n",
    "\n",
    "Under the hypothesis that $\\mu = t$, \n",
    "\n",
    "\\begin{equation*} \n",
    "   t^{-1} \\mu =  t^{-1} \\int_0^\\infty  xdF(x) = \\int_0^\\infty xt^{-1} dF(x) = 1.\n",
    "\\end{equation*}\n",
    "Fix $\\gamma \\in [0, 1]$.\n",
    "Because $\\int_0^\\infty dF = 1$, it follows that if $\\mu  = t$,\n",
    "\n",
    "\\begin{equation*}   \n",
    "    \\mathbb{E} \\left ( (\\gamma/t) X_j + (1-\\gamma) \\right ) = (\\gamma/t) \\mathbb{E} X_j + (1-\\gamma) =  \n",
    "     (\\gamma/t)t + (1-\\gamma) = 1.\n",
    "\\end{equation*}\n",
    "Now,\n",
    "\\begin{equation*} \n",
    "   \\mathbb{E} \\left ((\\gamma/t) X_j + (1-\\gamma) \\right ) := \n",
    "     \\int_0^\\infty \\left (x \\gamma/t + (1-\\gamma) \\right ) dF(x).\n",
    "\\end{equation*}\n",
    "Since for $x \\ge 0$, $(x \\gamma/t + (1-\\gamma)) \\ge 0$, it follows that if we define\n",
    "\n",
    "\\begin{equation*} \n",
    "   dG := (x \\gamma/t + (1-\\gamma))dF\n",
    "\\end{equation*}\n",
    "\n",
    "then $G$ is the cdf of a nonnegative random variable.\n",
    "\n",
    "Let $Y$ be a random variable with cdf $G$ and let $X$ be a random variable with cdf $F$.\n",
    "By Jensen's inequality, $\\mathbb{E} X^2 \\ge (\\mathbb{E} X)^2 = t \\cdot \\mathbb{E} X$ (under the null that $\\mathbb{E} X = t$).\n",
    "Since $\\mathbb{E} X = t \\ge 0$,\n",
    "\\begin{align}\n",
    "    \\mathbb{E} Y  &= \\int_0^\\infty x dG(x) \\\\\n",
    "                 &= \\int_0^\\infty x (x \\gamma/t + (1-\\gamma)) dF(x) \\\\\n",
    "                 &= \\gamma/t \\int_0^\\infty x^2 dF(x) + (1-\\gamma) \\int_0^\\infty x dF(x) \\\\\n",
    "                 &= \\gamma/t \\cdot \\mathbb{E} X^2 + (1-\\gamma) \\cdot \\mathbb{E} X \\\\\n",
    "                 &\\ge \\gamma \\cdot \\mathbb{E} X + (1-\\gamma) \\cdot \\mathbb{E} X  \\mbox{ (Jensen's inequality and $\\mathbb{E} X = t$)}\\\\\n",
    "                 &= \\mathbb{E} X = t\n",
    "\\end{align}\n",
    "If the data allow us to reject the hypothesis $H_0$ that $\\{ X_j\\}$ are IID with cdf $F$\n",
    "in favor of the alternative hypothesis $H_1$ that they are iid with cdf $G$,\n",
    "we have strong statistical evidence that $\\mu = \\mathbb{E} X_j > t$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the SPRT to test $H_1$ versus $H_0$\n",
    "\n",
    "Now a bit of magic occurs.  For a given observation $X_j = x_j$, what is the ratio of its \n",
    "probability if $H_1$ is true to its probability if $H_0$ is true?\n",
    "\\begin{equation*} \n",
    "    \\mathrm{PR}  = \\frac{dG(x_j)}{dF(x_j)} = \\frac{(x_j\\gamma/t+(1−\\gamma))dF(x_j)}{dF(x_j)} = (x_j\\gamma/t+(1−\\gamma)).\n",
    "\\end{equation*}\n",
    "This doesn't depend on $F$ or $G$!\n",
    "\n",
    "The $\\mathrm{PR} $ for observations $X_1, \\ldots, X_m$ is\n",
    "\\begin{equation*} \n",
    "      \\mathrm{PR}  = \\prod_{i=1}^m \\left ( (\\gamma/t) X_i + 1 - \\gamma \\right ).\n",
    "\\end{equation*}\n",
    "This expression shows why $\\gamma$ was introduced in the first place:\n",
    "for $\\gamma = 1$, if even a single observation turned out to be zero, $\\mathrm{PR} $ would forever be\n",
    "zero no matter how large subsequent observations turned out to be.\n",
    "Taking $\\gamma < 1$ hedges against that possibility.\n",
    "Any value of $\\gamma \\in [0, 1]$ gives a conservative test, but smaller values provide more \"protection\"\n",
    "against small values of $X_j$ (but incur a loss of power when all $X_j$ are large).\n",
    "\n",
    "Recall that the $1/\\mathrm{PR} $ is the $P$-value of $H_0: \\mu = t$ based on the observations $\\{X_j\\}_{j=1}^m$.\n",
    "We will use this to find a lower confidence bound for $\\mu$.\n",
    "\n",
    "### \"Look back\" and the SPRT\n",
    "There's more: recall that the SPRT says the chance that $\\mathrm{PR} $ _ever_ gets larger than $1/\\alpha$ is at most $\\alpha$ if $H_0$ is true.\n",
    "That means that we can use the observations sequentially, maximizing over the partial products.\n",
    "If any of the partial products exceeds $1/\\alpha$, we can reject $H_0$--even if the ratio becomes small again.\n",
    "\n",
    "That is, our level-$\\alpha$ test based on a sample of size $n$ is\n",
    "\\begin{equation*} \n",
    "    \\mbox{ Reject } H_0 \\mbox{ if } \\max_{m=1}^n \\prod_{i=1}^m \\left ( \\gamma X_i/t + 1 - \\gamma \\right ) \\ge 1/\\alpha.\n",
    "\\end{equation*}\n",
    "\n",
    "It is _only_ legitimate to do this maximization if the data are in random order.  \n",
    "For instance, it's impermissible to sort the data from largest to smallest. \n",
    "And if you maximize over partial products, the result will, in general, depend on the order of the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confidence bounds from the Kaplan-Wald test\n",
    "\n",
    "To find a lower confidence bound, we can invert hypothesis tests: the lower endpoint of a one-sided confidence bound for $\\mu$ is the largest value \n",
    "of $t$ for which we would not reject the hypothesis $\\mu = t$ at significance level $\\alpha$.\n",
    "\n",
    "For confidence levels above 50%, this lower confidence bound will certainly be between zero and the sample mean. However, for $t=0$, we have a divide-by-zero issue. Hence, for numerical implementation, we search the interval $[\\epsilon, \\bar{X}]$ for the smallest $t$ for which we can reject the hypothesis $\\mu = t$ at significance level $\\alpha$. If that smallest $t$ is $\\epsilon$, we set the confidence bound to zero. \n",
    "\n",
    "The following code implements this idea, working with the log of the test statistic to improve numerical stability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is the first cell with code: set up the Python environment\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "\n",
    "import numpy as np\n",
    "from numpy.polynomial import polynomial as P\n",
    "\n",
    "import scipy as sp\n",
    "import scipy.stats\n",
    "from scipy.stats import binom\n",
    "from scipy.optimize import brentq\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from ipywidgets import interact, interactive, fixed\n",
    "import ipywidgets as widgets\n",
    "from IPython.display import clear_output, display, HTML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def kaplanWaldLowerCI(x, cl = 0.95, gamma = 0.99, xtol=1.e-12, logf=True):\n",
    "    '''\n",
    "       Calculates the Kaplan-Wald lower 1-alpha confidence bound for the mean of a nonnegative random\n",
    "       variable.\n",
    "    '''\n",
    "    alpha = 1.0-cl\n",
    "    if any(x < 0):\n",
    "        raise ValueError('Data x must be nonnegative.')\n",
    "    elif all(x <= 0):\n",
    "        lo = 0.0\n",
    "    else:\n",
    "        if logf:\n",
    "            f = lambda t: (np.max(np.cumsum(np.log(gamma*x/t + 1 - gamma))) + np.log(alpha))\n",
    "        else:\n",
    "            f = lambda t: (np.max(np.cumprod(gamma*x/t + 1 - gamma)) - 1/alpha)\n",
    "        xm = np.mean(x)\n",
    "        if f(xtol)*f(xm) > 0.0:\n",
    "            lo = 0.0\n",
    "        else:\n",
    "            lo = sp.optimize.brentq(f, xtol, np.mean(x), xtol=xtol) \n",
    "    return lo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How well does it work?\n",
    "Let's test the method on our standard test problems: binomial and a mixture of pointmass and uniform.  We will fix $\\gamma$; you might experiment using different values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Simulated coverage probability and expected lower endpoint of one-sided Student-t and Kaplan-Wald confidence intervals for mixture of U[0,1] and pointmass at 1 population</h3><strong>Nominal coverage probability 95.0%</strong>. Gamma=0.99.<br /><strong>Estimated from 10000 replications.</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mass at 1</th>\n",
       "      <th>sample size</th>\n",
       "      <th>Student-t cov</th>\n",
       "      <th>Kaplan-Wald cov</th>\n",
       "      <th>Student-t low</th>\n",
       "      <th>Kaplan-Wald low</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.900</td>\n",
       "      <td>25</td>\n",
       "      <td>89.99%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.6842</td>\n",
       "      <td>0.8203</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.900</td>\n",
       "      <td>50</td>\n",
       "      <td>98.81%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.6702</td>\n",
       "      <td>0.8708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.900</td>\n",
       "      <td>100</td>\n",
       "      <td>99.98%</td>\n",
       "      <td>97.59%</td>\n",
       "      <td>0.6641</td>\n",
       "      <td>0.8961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.900</td>\n",
       "      <td>400</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>96.77%</td>\n",
       "      <td>0.6612</td>\n",
       "      <td>0.8988</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.990</td>\n",
       "      <td>25</td>\n",
       "      <td>21.91%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9535</td>\n",
       "      <td>0.8792</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.990</td>\n",
       "      <td>50</td>\n",
       "      <td>39.32%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9413</td>\n",
       "      <td>0.9341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.990</td>\n",
       "      <td>100</td>\n",
       "      <td>61.75%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9272</td>\n",
       "      <td>0.9627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.990</td>\n",
       "      <td>400</td>\n",
       "      <td>97.78%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9069</td>\n",
       "      <td>0.9847</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.999</td>\n",
       "      <td>25</td>\n",
       "      <td>2.48%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9952</td>\n",
       "      <td>0.8854</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.999</td>\n",
       "      <td>50</td>\n",
       "      <td>4.69%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9937</td>\n",
       "      <td>0.9406</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.999</td>\n",
       "      <td>100</td>\n",
       "      <td>9.15%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9915</td>\n",
       "      <td>0.9695</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0.999</td>\n",
       "      <td>400</td>\n",
       "      <td>32.84%</td>\n",
       "      <td>100.0%</td>\n",
       "      <td>0.9845</td>\n",
       "      <td>0.9917</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mass at 1 sample size Student-t cov Kaplan-Wald cov Student-t low  \\\n",
       "0       0.900          25        89.99%          100.0%        0.6842   \n",
       "1       0.900          50        98.81%          100.0%        0.6702   \n",
       "2       0.900         100        99.98%          97.59%        0.6641   \n",
       "3       0.900         400        100.0%          96.77%        0.6612   \n",
       "4       0.990          25        21.91%          100.0%        0.9535   \n",
       "5       0.990          50        39.32%          100.0%        0.9413   \n",
       "6       0.990         100        61.75%          100.0%        0.9272   \n",
       "7       0.990         400        97.78%          100.0%        0.9069   \n",
       "8       0.999          25         2.48%          100.0%        0.9952   \n",
       "9       0.999          50         4.69%          100.0%        0.9937   \n",
       "10      0.999         100         9.15%          100.0%        0.9915   \n",
       "11      0.999         400        32.84%          100.0%        0.9845   \n",
       "\n",
       "   Kaplan-Wald low  \n",
       "0           0.8203  \n",
       "1           0.8708  \n",
       "2           0.8961  \n",
       "3           0.8988  \n",
       "4           0.8792  \n",
       "5           0.9341  \n",
       "6           0.9627  \n",
       "7           0.9847  \n",
       "8           0.8854  \n",
       "9           0.9406  \n",
       "10          0.9695  \n",
       "11          0.9917  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Nonstandard mixture: a pointmass at 1 and a uniform[0,1]\n",
    "ns = np.array([25, 50, 100, 400])  # sample sizes\n",
    "ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass\n",
    "alpha = 0.05  # 1- (confidence level)\n",
    "reps = int(10**4)   # just for demonstration\n",
    "gamma = 0.99\n",
    "xtol = 1.e-12\n",
    "\n",
    "simTable = pd.DataFrame(columns=('mass at 1', 'sample size', 'Student-t cov',\\\n",
    "                                 'Kaplan-Wald cov', 'Student-t low', 'Kaplan-Wald low')\n",
    "                       )\n",
    "\n",
    "for p in ps:\n",
    "    popMean = (1-p)*0.5 + p  # population is a mixture of uniform with mass (1-p) and\n",
    "                             # pointmass at 1 with mass p\n",
    "    \n",
    "    for n in ns:\n",
    "        tCrit = sp.stats.t.ppf(q=1-alpha, df=n-1)\n",
    "        coverT = 0\n",
    "        coverK = 0\n",
    "        lowT = 0.0\n",
    "        lowK = 0.0\n",
    "        \n",
    "        for rep in range(int(reps)):\n",
    "            sam = np.random.uniform(size=n)  # the uniform part of the sample\n",
    "            ptMass = np.random.uniform(size=n)  # for a bunch of p-coin tosses\n",
    "            sam[ptMass < p] = 1.0   # mix in pointmass at 1, with probability p\n",
    "            # t interval\n",
    "            samMean = np.mean(sam)\n",
    "            samSD = np.std(sam, ddof=1)\n",
    "            tLo = samMean - tCrit*samSD  # lower endpoint of the t interval\n",
    "            coverT += ( tLo <= popMean )\n",
    "            lowT += tLo\n",
    "            #  Kaplan-Wald interval\n",
    "            kLo = kaplanWaldLowerCI(sam, cl=1-alpha, gamma=gamma, xtol=xtol) # lower endpoint of the Kaplan-Wald interval\n",
    "            coverK += (kLo <= popMean )\n",
    "            lowK += kLo\n",
    "\n",
    "        simTable.loc[len(simTable)] =  p, n,\\\n",
    "            str(100*float(coverT)/float(reps)) + '%',\\\n",
    "            str(100*float(coverK)/float(reps)) + '%',\\\n",
    "            str(round(lowT/float(reps),4)),\\\n",
    "            str(round(lowK/float(reps),4))\n",
    "#\n",
    "ansStr =  '<h3>Simulated coverage probability and expected lower endpoint of ' +\\\n",
    "          'one-sided Student-t and Kaplan-Wald confidence intervals for ' +\\\n",
    "          'mixture of U[0,1] and pointmass at 1 population</h3>' +\\\n",
    "          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
    "          '%</strong>. Gamma=' + str(gamma) +\\\n",
    "          '.<br /><strong>Estimated from ' + str(int(reps)) +\\\n",
    "          ' replications.</strong>'\n",
    "\n",
    "display(HTML(ansStr))\n",
    "display(simTable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks pretty good!  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upper confidence bounds and two-sided bounds\n",
    "\n",
    "If every $X_j$ has the same finite, a priori upper bound $u$, we can transform the problem by defining $\\tilde{X_j}= u - X_j$.  Then $\\tilde{X_j}$ is a nonnegative random variable, and a lower confidence bound on its mean translated to can be subtracted from $u$ to make an upper confidence bound on $\\mathbb{E} X_j$.\n",
    "\n",
    "If every $X_j$ has the finite, a priori upper and lower bound, we can find two-sided confidence intervals in the analogous way, applying the Kaplan Wald method to the original data to find lower confidence bounds and to the data subtracted from the a priori upper bound to find upper confidence bounds."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Lower confidence bounds on the mean for sampling without replacement: Ville's inequality\n",
    "\n",
    "This is an alternative derivation of the Kaplan-Wald method, also from [Harold Kaplan's website](http://web.archive.org/web/20131209044835/http://printmacroj.com/martMean.htm), involving the fact that a suitably constructed sequence is a martingale.\n",
    "It relies on Ville's inequality for optionally stopped closed martingales.\n",
    "\n",
    "Suppose we are sampling without replacement from a finite population of $N$ \n",
    "non-negative items,  $\\{x_1, \\ldots, x_N\\}$, with $x_j \\ge 0$ $\\forall j$.\n",
    "The population mean is $\\mu = \\frac{1}{N} \\sum_{j=1}^N x_j \\ge 0$ and the population total\n",
    "is $N\\mu \\ge 0$.\n",
    "We draw $\\{X_1, \\ldots, X_n \\}$ sequentially without replacement.\n",
    "On the hypothesis $\\mu = t$, $\\mathbb{E}X_1 = t$, so $\\mathbb{E}(X_1/t) = 1$.\n",
    "Conditional on $X_1, \\ldots, X_n$, the total of the remaining $N-n$ items is\n",
    "$N\\mu - \\sum_{j=1}^n X_j$, so the mean of the remaining items is\n",
    "\n",
    "\\begin{equation*} \n",
    "    \\frac{Nt-\\sum_{j=1}^n X_j}{N-n} = \\frac{t - \\frac{1}{N} \\sum_{j=1}^n X_j}{1-n/N}.\n",
    "\\end{equation*}\n",
    "Thus, the expected value of $X_{n+1}$ given $X_1, \\ldots, X_n$ is \n",
    "$\\frac{t - \\frac{1}{N} \\sum_{j=1}^n X_j}{1-n/N}$.\n",
    "\n",
    "Define \n",
    "\n",
    "\\begin{equation*} \n",
    "Y_1(t) :=\n",
    "\\begin{cases}\n",
    "X_1/t,& Nt > 0, \\\\\n",
    "1,&   Nt = 0, \\\\\n",
    "\\end{cases}\n",
    "\\end{equation*}\n",
    "\n",
    "and for $1 \\le n \\le N-1$,\n",
    "\\begin{equation*} \n",
    "Y_{n+1}(t) \n",
    ":=\n",
    "\\begin{cases}\n",
    "X_{n+1} \n",
    "\\frac\n",
    "{1 - \\frac{n}{N}}\n",
    "{t - \\frac{1}{N} \\sum_{j=1}^n X_j},& \\sum_{j=1}^n X_j < Nt, \\\\\n",
    "1,& \\sum_{j=1}^n X_j \\ge Nt. \\\\\n",
    "\\end{cases}\n",
    "\\end{equation*} \n",
    "\n",
    "Then $\\mathbb{E}(Y_{n+1}(t) | Y_1, \\ldots Y_n) = 1$.\n",
    "Let $Z_{n}(t) := \\prod_{j=1}^n Y_j(t)$.\n",
    "Note that $Y_k(t)$ can be recovered from $\\{Z_j(t), j \\le k\\}$, \n",
    "since $Y_k(t) = Z_k(t)/Z_{k-1}(t)$.\n",
    "Now $\\mathbb{E}|Z_k| \\le \\max_j x_j < \\infty$ and\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\mathbb{E}\\left ( Z_{n+1}(t) | Z_1(t), \\ldots Z_n(t) \\right ) = \n",
    "   \\mathbb{E} \\left (Y_{n+1}(t)Z_n(t) | Z_1(t), \\ldots Z_n(t) \\right ) = Z_n(t).\n",
    "\\end{equation*}\n",
    "\n",
    "Thus \n",
    "\n",
    "\\begin{equation*} \n",
    "    \\left ( Z_1(t), Z_2(t), \\;\\ldots , Z_N(t) \\right )\n",
    "\\end{equation*}\n",
    "\n",
    "is a non-negative closed martingale.\n",
    "\n",
    "By Ville's inequality (sometimes called Kolmogorov's inequality), an application of Markov's inequality to \n",
    "nonnegative martingales\n",
    "(Feller V2, p.242), for any $p > 0$ and any $J \\in \\{1, \\ldots, N \\}$,\n",
    "\n",
    "\\begin{equation*} \n",
    "     \\mathbb{P} \\left ( \\max_{1 \\le j \\le J} Z_j(t) > 1/p \\right ) \\le p \\; \\mathbb{E}|Z_J|.\n",
    "\\end{equation*}\n",
    "\n",
    "Since $(Z_j)$ is a non-negative martingale, $\\mathbb{E}|Z_J| = \\mathbb{E}Z_J = \\mathbb{E}Z_1 = 1$.\n",
    "\n",
    "Thus a $p$-value for the hypothesis $\\mu = t$ based on data $X_1, \\ldots X_J$ is \n",
    "$\\left (\\max_{1 \\le j \\le J} Z_j(t) \\right )^{-1} \\wedge 1$.\n",
    "If $X_j = 0$ for some $j$, then $Z_k = 0$ for all $k \\ge j$.\n",
    "\n",
    "To avoid that problem, we can shift everything to the right: pick $\\gamma > 0$,\n",
    "find a lower confidence bound for $\\delta = \\mu+\\gamma > \\mu > 0$ from data $\\{X_j+\\gamma\\}$, then subtract $\\gamma$ from the lower \n",
    "confidence bound to get a lower confidence bound for $\\mu$.\n",
    "There are tradeoffs involved in picking $\\gamma$: if many $X_j$ turn out to be \n",
    "small, especially for small $j$, it helps to have $\\gamma$ large, and vice versa.\n",
    "\n",
    "Unpacking the math yields the $p$ value\n",
    "\n",
    "\\begin{equation*} \n",
    "p_{\\mathrm{KK}} := 1 \\wedge \\left ( \\max_{1 \\le j \\le J} \\prod_{k=1}^j (X_{k}+\\gamma) \\frac{1-(k-1)/N}{t - \\frac{1}{N} \\sum_{\\ell=1}^{k-1} (X_\\ell+\\gamma)} \\right )^{-1}\n",
    "\\end{equation*}\n",
    "for the hypothesis that $\\mu \\le t - \\gamma$.\n",
    "\n",
    "The corresponding $1-\\alpha$ lower confidence bound is\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\sup \\left \\{t \\ge 0: \\max_{1 \\le j \\le J}  \\prod_{k=1}^j (X_{k}+\\gamma) \\frac{1-(k-1)/N}{t - \\frac{1}{N} \\sum_{\\ell=1}^{k-1} (X_\\ell+\\gamma)} \\le 1/\\alpha \\right \\} - \\gamma.\n",
    "\\end{equation*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Application: marbles in a jar\n",
    "\n",
    "A certain jar contains $N$ (even) marbles, $G$ green and $N-G$ non-green.\n",
    "We sample marbles sequentially without replacement. We wish to test the hypothesis\n",
    "$G \\le N/2$ against the alternative.\n",
    "For instance, $G$ might represent the number of votes for the reported winner of an election\n",
    "in which $N$ votes were cast.\n",
    "If the social choice function requires a majority, or if there are only two candidates, then\n",
    "green wins if $G > N/2$ and the outcome is a tie or a loss if $G \\le N/2$.\n",
    "If we can reject the hypothesis that $G \\le N/2$, we can conclude that the outcome is correct.\n",
    "\n",
    "Use the previous method, taking $t = 1/2$ and shift $\\delta > 0$.\n",
    "A pleasing choice for symmetry would be $\\delta = 1/2$, but the operating characteristics\n",
    "depend on the true value of $G$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Another martingale test for the mean for sampling without replacement\n",
    "\n",
    "Let's define a finite-sample version of the Kaplan-Wald approach, to which we can apply Ville's Inequality.\n",
    "\n",
    "Let $S_j := \\sum_{k=1}^j X_k$, $\\tilde{S}_j := S_j/N$, and $\\tilde{j} := 1 - (j-1)/N$.\n",
    "Define\n",
    "\n",
    "\\begin{equation*} \n",
    "   Y_n := \\int_0^1 \\prod_{j=1}^n \\left ( \\gamma \\left [ X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1 \\right ] + 1 \\right ) d\\gamma.\n",
    "\\end{equation*}\n",
    "\n",
    "This is a polynomial in $\\gamma$ of degree at most $n$, with constant term $1$.\n",
    "Each $X_j$ appears linearly.\n",
    "Under the null hypothesis that the population total is $Nt$, $\\mathbb{E} X_1 = t$,\n",
    "and \n",
    "\n",
    "\\begin{equation*} \n",
    "  \\mathbb{E} \\left ( X_j \\mid X_1, \\ldots, X_{j-1} \\right ) = \n",
    "  \\frac{Nt - S_{j-1}}{N-j+1} = \\frac{t - \\tilde{S}_{j-1}}{\\tilde{j}}.\n",
    "\\end{equation*}\n",
    "\n",
    "Now \n",
    "\n",
    "\\begin{equation*} \n",
    "   Y_1 = \\int_0^1 \\left ( \\gamma [ X_1/t - 1] + 1 \\right ) d\\gamma = \n",
    "   \\left [ (\\gamma^2/2) [X_1/t - 1] + \\gamma \\right ]_{\\gamma=0}^1 = [X_1/t - 1]/2 + 1 = \\frac{X_1}{2t} + 1/2.\n",
    "\\end{equation*}\n",
    "\n",
    "Thus, under the null, \n",
    "\\begin{equation*} \n",
    "   \\mathbb{E}Y_1 = \\frac{\\mathbb{E}X_1}{2t} + 1/2 = 1.\n",
    "\\end{equation*}\n",
    "\n",
    "Also,\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\mathbb{E}(Y_n | X_1, \\ldots, X_{n-1}) =\n",
    "   \\mathbb{E} \\left . \\left [ \\int_0^1 \\prod_{j=1}^n \\left (\\gamma \\left [ X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1 \\right ] + 1 \\right ) d\\gamma \\right | X_1, \\ldots, X_{n-1} \\right ]\n",
    "\\end{equation*}\n",
    "\\begin{equation*} \n",
    "  = \\int_0^1  \\left (\\gamma \\left [ \\mathbb{E}(X_n | X_1, \\ldots, X_{n-1}) \\frac{\\tilde{n}}{t - \\tilde{S}_{n-1}} -1 \\right ] + 1 \\right ) \\prod_{j=1}^{n-1} \\left ( \\gamma \\left [ X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1 \\right ] + 1 \\right ) d\\gamma \n",
    "\\end{equation*}\n",
    "\\begin{equation*} \n",
    "  = \\int_0^1  \\left (\\gamma \\left [ \\frac{t - \\tilde{S}_{n-1}}{\\tilde{n}} \\frac{\\tilde{n}}{t - \\tilde{S}_{n-1}} -1 \\right ] + 1 \\right ) \\prod_{j=1}^{n-1} \\left ( \\gamma \\left [ X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1 \\right ] + 1 \\right ) d\\gamma \n",
    "\\end{equation*}\n",
    "\\begin{equation*} \n",
    "  = \\int_0^1  \\prod_{j=1}^{n-1} \\left ( \\gamma \\left [ X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1 \\right ] + 1 \\right ) d\\gamma = Y_{n-1}.\n",
    "\\end{equation*}\n",
    "\n",
    "Thus, under the null hypothesis, $(Y_j )_{j=1}^N$ is a nonnegative closed martingale with expected value 1, and Ville's inequality implies that for any $J \\in \\{1, \\ldots, N\\}$,\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\mathbb{P} \\left ( \\max_{1 \\le j \\le J} Y_j(t) > 1/p \\right ) \\le p.\n",
    "\\end{equation*}\n",
    "\n",
    "Set \n",
    "\n",
    "\\begin{equation*} \n",
    "    c_j := X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1,\n",
    "\\end{equation*}\n",
    "\n",
    "and re-write the polynomial\n",
    "\n",
    "\\begin{equation*} \n",
    " \\prod_{j=1}^n \\left (\\gamma \\left [ X_j \\frac{\\tilde{j}}{t - \\tilde{S}_{j-1}} -1 \\right ] + 1 \\right ) = \\prod_{j: c_j \\ne 0} c_j(\\gamma + c_j^{-1}).\n",
    "\\end{equation*}\n",
    "This expresses the polynomial in terms of its roots, facilitating computations in Python.\n",
    "Using this notation,\n",
    "\n",
    "\\begin{equation*} \n",
    "   Y_j = \\int_0^1 \\prod_{j: c_j \\ne 0} c_j(\\gamma + c_j^{-1}) d\\gamma =\n",
    "   \\left ( \\prod_{j: c_j \\ne 0} c_j \\right ) \\int_0^1 \\prod_{j: c_j \\ne 0} (\\gamma + c_j^{-1}) d\\gamma.\n",
    "\\end{equation*}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def KK_p(x, N, t, random_order = True):\n",
    "    '''\n",
    "    p-value for the hypothesis that the mean of a nonnegative population with N\n",
    "    elements is t. The alternative is that the mean is greater than t.\n",
    "    If the random sample x is in the order in which the sample was drawn, it is\n",
    "    legitimate to set random_order = True. \n",
    "    If not, set random_order = False. \n",
    "    '''\n",
    "    x = np.array(x)\n",
    "    assert all(x >=0),  'Negative value in a nonnegative population!'\n",
    "    assert len(x) <= N, 'Sample size is larger than the population!'\n",
    "    assert N > 0,       'Population size not positive!'\n",
    "    assert N == int(N), 'Non-integer population size!'\n",
    "    sample_total = 0.0\n",
    "    mart = x[0]/t if t > 0 else 1\n",
    "    mart_max = mart\n",
    "    for j in range(1, len(x)):\n",
    "        mart *= x[j]*(1-j/N)/(t - (1/N)*sample_total)\n",
    "        if mart < 0:\n",
    "            mart = np.inf\n",
    "            break\n",
    "        else:\n",
    "            sample_total += x[j]\n",
    "        mart_max = max(mart, mart_max)\n",
    "    p = min((1/mart_max if random_order else 1/mart),1)\n",
    "    return p   \n",
    "\n",
    "def HK_ps_p(x, N, t, random_order = True):\n",
    "    '''\n",
    "    Harold Kaplan p-value for the hypothesis that the mean of a nonnegative \n",
    "    population with N elements is t. \n",
    "    The alternative is that the mean is larger than t.\n",
    "    If the random sample x is in the order in which the sample was drawn, it is\n",
    "    legitimate to set random_order = True. \n",
    "    If not, set random_order = False. \n",
    "    '''\n",
    "    x = np.array(x)\n",
    "    assert all(x >=0),  'Negative value in a nonnegative population!'\n",
    "    assert len(x) <= N, 'Sample size is larger than the population!'\n",
    "    assert N > 0,       'Population size not positive!'\n",
    "    assert N == int(N), 'Non-integer population size!'\n",
    "    Stilde = (np.insert(np.cumsum(x),0,0)/N)[0:len(x)] # \\tilde{S}_{j-1}\n",
    "    t_minus_Stilde = t - Stilde\n",
    "    mart_max = 1\n",
    "    if any(t_minus_Stilde < 0): # sample total exceeds hypothesized population total\n",
    "        mart_max = np.inf\n",
    "    else:  # TO FIX: need to deal with zeros in t_minus_stilde\n",
    "        jtilde = 1 - np.array(list(range(len(x))))/N\n",
    "        # c_j = X_j*\\tilde{j}/(t-\\tilde{S}_{j-1}) - 1\n",
    "        c = np.multiply(x, np.divide(jtilde, t_minus_Stilde))-1  # need to test for zeros!\n",
    "        for j in range(len(x)):\n",
    "            r = -np.array([1/cc for cc in c[0:j+1] if cc != 0]) # roots\n",
    "            if r.size > 0:\n",
    "                Y_norm = np.prod(np.array([cc for cc in c[0:j+1] if cc != 0])) # multiplicative constant\n",
    "                poly = P.polyfromroots(r)\n",
    "                poly_int = P.polyint(poly)\n",
    "                poly_int_values = P.polyval([1], poly_int)\n",
    "                mart = Y_norm*poly_int_values[0]\n",
    "                mart_max = max(mart_max, mart) if random_order else mart\n",
    "            else:\n",
    "                mart_max = max(mart_max, 1) if random_order else 1\n",
    "    p = min(1/mart_max,1)\n",
    "    return p   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Recursive integral from 0 to 1 of a polynomial in terms of its roots\n",
    "\n",
    "def integral_from_roots_1(c, maximal = True):\n",
    "    '''\n",
    "    Integrate the polynomial \\prod_{k=1}^n (x-c_j) from 0 to 1, i.e.,\n",
    "       \\int_0^1 \\prod_{k=1}^n (x-c_j) dx\n",
    "    using a recursive algorithm.\n",
    "    \n",
    "    If maximal == True, finds the maximum of the integrals over lower degrees:\n",
    "       \\max_{1 \\le k \\le n} \\int_0^1 \\prod_{j=1}^k (x-c_j) dx\n",
    "    \n",
    "    Input\n",
    "    ------\n",
    "    c : array of roots\n",
    "    \n",
    "    Returns\n",
    "    ------\n",
    "    the integral or maximum integral\n",
    "    '''\n",
    "    n = len(c)\n",
    "    b = np.zeros((n+1,n+1))\n",
    "    b[0,0]=1\n",
    "    for k in np.arange(1,n+1):\n",
    "        b[k,0] = -c[k-1]*b[k-1,0]\n",
    "        for j in np.arange(1,n+1):\n",
    "            b[k,j] = (j/(j+1))*b[k-1,j-1] - c[k-1]*b[k-1,j]\n",
    "    if maximal:\n",
    "        integrals = np.zeros(n+1)\n",
    "        for k in np.arange(n+1):\n",
    "            integrals[k] = np.sum(b[k,:])\n",
    "        integral = np.max(integrals[1:])\n",
    "    else:\n",
    "        integral = np.sum(b[n,:])\n",
    "    return integral   \n",
    "\n",
    "def integral_from_roots_2(c, maximal = True):\n",
    "    '''\n",
    "    Integrate the polynomial \\prod_{k=1}^n (x-c_j) from 0 to 1, i.e.,\n",
    "       \\int_0^1 \\prod_{k=1}^n (x-c_j) dx\n",
    "    using a recursive algorithm.\n",
    "    \n",
    "    If maximal == True, finds the maximum of the integrals over lower degrees:\n",
    "       \\max_{1 \\le k \\le n} \\int_0^1 \\prod_{j=1}^k (x-c_j) dx\n",
    "    \n",
    "    Input\n",
    "    ------\n",
    "    c : array of roots\n",
    "    \n",
    "    Returns\n",
    "    ------\n",
    "    the integral or maximum integral\n",
    "    '''\n",
    "    n = len(c)\n",
    "    a = np.zeros((n+1,n+1))\n",
    "    a[0,0]=1\n",
    "    for k in np.arange(n):\n",
    "        for j in np.arange(n+1):\n",
    "            a[k+1,j] = -c[k]*((k+1-j)/(k+1))*a[k,j]\n",
    "            a[k+1,j] += 0 if j==0 else (1-c[k])*(j/(k+1))*a[k,j-1]\n",
    "    if maximal:\n",
    "        integrals = np.zeros(n+1)\n",
    "        for k in np.arange(1,n+1):\n",
    "            integrals[k] = np.sum(a[k,:])/(k+1)\n",
    "        integral = np.max(integrals[1:])\n",
    "    else:\n",
    "        integral = np.sum(a[n,:])/(n+1)\n",
    "    return integral   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.000999000999000999\n",
      "-8.349948460690676e+281\n"
     ]
    }
   ],
   "source": [
    "c = np.repeat(1, 1000)\n",
    "print(integral_from_roots_2(c, maximal = False))\n",
    "print(integral_from_roots_1(c, maximal = False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples for testing the recursive algorithm\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\int_0^1 (x-1) dx = -1/2.\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\int_0^1 (x-1)^2 dx = 1/3.\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\int_0^1 (x-1)^3 dx = -1/4.\n",
    "\\end{equation*}\n",
    "\n",
    "\\begin{equation*} \n",
    "   \\int_0^1 (x-1)^3(x+1) dx = -3/10.\n",
    "\\end{equation*}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-0.5\n",
      "0.3333333333333333\n",
      "-0.25\n",
      "-0.3\n",
      "0.3333333333333333\n"
     ]
    }
   ],
   "source": [
    "for i in range(1,4,1):\n",
    "    c = np.repeat(1, i)   # root 1 occurs with multiplicity i\n",
    "    print(integral_from_roots_2(c, maximal = False))\n",
    "\n",
    "c4 = [1, 1, 1, -1]                 \n",
    "print(integral_from_roots_2(c4, maximal = False))\n",
    "print(integral_from_roots_2(c4, maximal = True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def HK_ps_se_p(x, N, t, random_order = True):\n",
    "    '''\n",
    "    Harold Kaplan p-value for the hypothesis that the mean of a nonnegative \n",
    "    population with N elements is t. \n",
    "    The alternative is that the mean is larger than t.\n",
    "    If the random sample x is in the order in which the sample was drawn, it is\n",
    "    legitimate to set random_order = True. \n",
    "    If not, set random_order = False. \n",
    "    '''\n",
    "    x = np.array(x)\n",
    "    assert all(x >=0),  'Negative value in a nonnegative population!'\n",
    "    assert len(x) <= N, 'Sample size is larger than the population!'\n",
    "    assert N > 0,       'Population size not positive!'\n",
    "    assert N == int(N), 'Non-integer population size!'\n",
    "    Stilde = (np.insert(np.cumsum(x),0,0)/N)[0:len(x)] # \\tilde{S}_{j-1}\n",
    "    t_minus_Stilde = t - Stilde\n",
    "    mart_max = 1\n",
    "    if any(t_minus_Stilde < 0): # sample total exceeds hypothesized population total\n",
    "        mart_max = np.inf\n",
    "    elif np.mean(x) <= t: # sample mean does not exceed hypothesized population mean\n",
    "        mart_max = 1\n",
    "    else:  # TO FIX: need to deal with zeros in t_minus_stilde\n",
    "        jtilde = 1 - np.array(list(range(len(x))))/N\n",
    "        # c_j = X_j*\\tilde{j}/(t-\\tilde{S}_{j-1}) - 1\n",
    "        c = np.multiply(x, np.divide(jtilde, t_minus_Stilde))-1  # need to test for zeros!\n",
    "        for j in range(len(x)):\n",
    "            r = -np.array([1/cc for cc in c[0:j+1] if cc != 0]) # roots\n",
    "            if r.size > 0:\n",
    "                Y_norm = np.prod(np.array([cc for cc in c[0:j+1] if cc != 0])) # multiplicative constant\n",
    "                mart = Y_norm*integral_from_roots_2(r, maximal = False)\n",
    "                mart_max = max(mart_max, mart) if random_order else mart\n",
    "            else:\n",
    "                mart_max = max(mart_max, 1) if random_order else 1\n",
    "    p = min(1/mart_max,1)\n",
    "    return p   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Harold Kaplan's javascript routine asStated() translated into python\n",
    "\n",
    "def HK_p(x, N, t, random_order = True):\n",
    "    x = np.array(x)\n",
    "    assert all(x >=0),  'Negative value in a nonnegative population!'\n",
    "    assert len(x) <= N, 'Sample size is larger than the population!'\n",
    "    assert N > 0,       'Population size not positive!'\n",
    "    assert N == int(N), 'Non-integer population size!'\n",
    "    big = 0.\n",
    "    denominator = 1\n",
    "    sumOfPreviousXValues = 0.\n",
    "    n = len(x)\n",
    "    w = np.zeros(n+1)\n",
    "    w[0] = 1\n",
    "    for j in range(1,n+1):\n",
    "        reduced_t = t - sumOfPreviousXValues/N\n",
    "        if reduced_t < 0:\n",
    "            return 0\n",
    "        expectedX = reduced_t/( 1-(j-1)/N )\n",
    "        if expectedX > 0:\n",
    "            quotient = x[j-1]/expectedX\n",
    "            for k in range(j,0,-1):\n",
    "                w[k] = quotient*w[k-1]*k/j + w[k]*(j-k)/j\n",
    "            denominator += 1\n",
    "        tot = np.sum(w[0:(j+1)])\n",
    "        candidate = tot/denominator\n",
    "        big = max(big, candidate) if random_order else candidate\n",
    "        sumOfPreviousXValues += x[j-1]\n",
    "        reduced_t = t - sumOfPreviousXValues/N\n",
    "        if reduced_t < 0:\n",
    "            return 0\n",
    "    return min(1/big,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.00013854893672071193 0.00013854893672071196\n"
     ]
    }
   ],
   "source": [
    "# Compare the python implementation with the translation of Kaplan's javascript into python\n",
    "x = np.array(range(10)) \n",
    "t = 1\n",
    "N = 1000\n",
    "print(HK_p(x, N, t), HK_ps_se_p(x, N, t))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-11-377c2acb08b6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;36m100\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m500\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;36m400\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;36m150\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mt\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m300\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Kaplan:'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mHK_p\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mN\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m' term-by-term:'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mHK_ps_se_p\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mN\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-8-31daa2d31738>\u001b[0m in \u001b[0;36mHK_ps_se_p\u001b[0;34m(x, N, t, random_order)\u001b[0m\n\u001b[1;32m     26\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msize\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     27\u001b[0m                 \u001b[0mY_norm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprod\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcc\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcc\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# multiplicative constant\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 28\u001b[0;31m                 \u001b[0mmart\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mY_norm\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mintegral_from_roots_2\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmaximal\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     29\u001b[0m                 \u001b[0mmart_max\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmart_max\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmart\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mrandom_order\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mmart\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     30\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-5-7b0a799b4932>\u001b[0m in \u001b[0;36mintegral_from_roots_2\u001b[0;34m(c, maximal)\u001b[0m\n\u001b[1;32m     56\u001b[0m     \u001b[0;32mfor\u001b[0m \u001b[0mk\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     57\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mj\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 58\u001b[0;31m             \u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0mc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     59\u001b[0m             \u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m0\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mj\u001b[0m\u001b[0;34m==\u001b[0m\u001b[0;36m0\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0mc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mj\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     60\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mmaximal\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "# term-by-term integration of the polynomial is unstable when the sample size is large:\n",
    "\n",
    "N =200000\n",
    "x = np.array([0]*100 + [500]*400 + [100]*150)\n",
    "t = 300\n",
    "print('Kaplan:', HK_p(x, N, t), ' term-by-term:', HK_ps_se_p(x, N, t))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def CI_from_test(x, N, cl=0.95, d = 0, random_order = True,\n",
    "                         p_value = HK_p, **kwargs):\n",
    "    '''\n",
    "    Lower confidence bound for the mean of a finite, nonnegative population.\n",
    "    \n",
    "    x: data, a random sample without replacement\n",
    "    N: population size\n",
    "    cl: desired confidence level\n",
    "    d: a shift. If d!=0, finds a confidence interval for the mean of x+d, then subtracts d.\n",
    "       d must be selected before looking at the data\n",
    "    random_order: if True, the data x must be in the (random) order in which the sample was\n",
    "        drawn. If x is not known to be in random order, set random_order = False\n",
    "    '''\n",
    "    \n",
    "    assert cl > 0.5,    'Confidence level must be at least 50%.'\n",
    "    assert all(x >=0),  'Negative value in a nonnegative population!'\n",
    "    assert len(x) <= N, 'Sample size is larger than the population!'\n",
    "    assert N > 0,       'Population size not positive!'\n",
    "    assert N == int(N), 'Non-integer population size!'\n",
    "    \n",
    "    x = np.array(x)\n",
    "    \n",
    "    # a lower confidence bound at cl > 0.5 should not be larger than the sample mean\n",
    "    if random_order:\n",
    "        u = np.amax(np.array([np.mean(x[0:j+1]) for j in range(len(x))]))\n",
    "    else:\n",
    "        u = np.mean(x) \n",
    "    f = lambda t: p_value(x+d, N, t, random_order = random_order) - (1-cl)\n",
    "    lcl = 0.0\n",
    "    if f(0) < 0.0:\n",
    "        lcl = brentq(f, 0, u+d, *kwargs) - d\n",
    "    return max(lcl,0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9199431415951652 0.6017412738014762\n"
     ]
    }
   ],
   "source": [
    "cl = 1-.05\n",
    "\n",
    "x = [0]*30 + [2]*50\n",
    "x = np.array(x)\n",
    "reps = int(10)\n",
    "\n",
    "sims_kk = np.zeros(reps)\n",
    "sims = np.zeros(reps)\n",
    "for i in range(reps):\n",
    "    xp = np.random.permutation(x)\n",
    "    sims[i] = CI_from_test(x, N, d=0, cl=cl, random_order = True, \n",
    "                                  p_value = HK_p)\n",
    "    sims_kk[i] = CI_from_test(x, N, d=10, cl=cl, random_order = True, \n",
    "                                  p_value = HK_ps_se_p)\n",
    "print(np.mean(sims), np.mean(sims_kk))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.15235675574875873\n",
      "0.15235675574875873\n"
     ]
    }
   ],
   "source": [
    "N = 36666\n",
    "x = [0]*30 + [2]*53\n",
    "x = np.array(x)\n",
    "t = 0\n",
    "print(HK_p(x, N, t+1, random_order=False))\n",
    "print(HK_ps_se_p(x, N, t+1, random_order=False))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 10\n",
    "reps = 10**4\n",
    "p = np.zeros(reps)\n",
    "for i in range(int(reps)):\n",
    "    p[i] = HK_ps_se_p(2*np.random.random(n), N, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pct = np.percentile(p, range(100))\n",
    "plt.plot(pct, range(100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Nonstandard mixture: a pointmass at 1 and a uniform[0,1]\n",
    "ns = np.array([25, 50, 100, 400])  # sample sizes\n",
    "ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass\n",
    "alpha = 0.05  # 1- (confidence level)\n",
    "reps = int(10**2)   # just for demonstration\n",
    "d = 1\n",
    "xtol = 1.e-12\n",
    "N = 20000 # hypothetical population size\n",
    "\n",
    "simTable = pd.DataFrame(columns=('mass at 1', 'sample size', 'Kaplan-Ville ES cov',\\\n",
    "                                 'Kaplan-Ville cov', 'KK ES low', 'KK low')\n",
    "                       )\n",
    "\n",
    "for p in ps:\n",
    "    popMean = (1-p)*0.5 + p  # population is a mixture of uniform with mass (1-p) and\n",
    "                             # pointmass at 1 with mass p\n",
    "    \n",
    "    for n in ns:\n",
    "        tCrit = sp.stats.t.ppf(q=1-alpha, df=n-1)\n",
    "        coverT = 0\n",
    "        coverK = 0\n",
    "        lowT = 0.0\n",
    "        lowK = 0.0\n",
    "        \n",
    "        for rep in range(int(reps)):\n",
    "            sam = np.random.uniform(size=n)  # the uniform part of the sample\n",
    "            ptMass = np.random.uniform(size=n)  # for a bunch of p-coin tosses\n",
    "            sam[ptMass < p] = 1.0   # mix in pointmass at 1, with probability p\n",
    "            # Kaplan's mixture\n",
    "            tLo = CI_from_test(sam, N, d=0, cl=cl, random_order = True, \n",
    "                                  p_value = HK_p)\n",
    "            coverT += ( tLo <= popMean )\n",
    "            lowT += tLo\n",
    "            #  Kaplan-Wald interval\n",
    "            kLo = CI_from_test(sam, N, d=d, cl=cl, random_order = True, \n",
    "                                  p_value = HK_ps_se_p) # lower endpoint of the Kaplan-Wald interval\n",
    "            coverK += (kLo <= popMean )\n",
    "            lowK += kLo\n",
    "\n",
    "        simTable.loc[len(simTable)] =  p, n,\\\n",
    "            str(100*float(coverT)/float(reps)) + '%',\\\n",
    "            str(100*float(coverK)/float(reps)) + '%',\\\n",
    "            str(round(lowT/float(reps),4)),\\\n",
    "            str(round(lowK/float(reps),4))\n",
    "#\n",
    "ansStr =  '<h3>Simulated coverage probability and expected lower endpoint of ' +\\\n",
    "          'one-sided Kaplan-Ville and Kaplan-Wald confidence intervals for ' +\\\n",
    "          'mixture of U[0,1] and pointmass at 1 population of' + str(N) + 'items</h3>' +\\\n",
    "          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
    "          '%</strong>. d=' + str(d) +\\\n",
    "          '.<br /><strong>Estimated from ' + str(int(reps)) +\\\n",
    "          ' replications.</strong>'\n",
    "\n",
    "display(HTML(ansStr))\n",
    "display(simTable)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Nonstandard mixture: a pointmass at 0 and a uniform[0,1]\n",
    "ns = np.array([25, 50, 100, 400])  # sample sizes\n",
    "ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass\n",
    "alpha = 0.05  # 1- (confidence level)\n",
    "reps = int(10**2)   # just for demonstration\n",
    "d = 1\n",
    "xtol = 1.e-12\n",
    "N = 20000 # hypothetical population size\n",
    "\n",
    "simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'HK cov',\\\n",
    "                                 'HK PS SE cov', 'HK low', 'HK PS SE low')\n",
    "                       )\n",
    "\n",
    "for p in ps:\n",
    "    popMean = (1-p)*0.5 + p*0  # population is a mixture of uniform with mass (1-p) and\n",
    "                               # pointmass at 0 with mass p\n",
    "    \n",
    "    for n in ns:\n",
    "        tCrit = sp.stats.t.ppf(q=1-alpha, df=n-1)\n",
    "        coverT = 0\n",
    "        coverK = 0\n",
    "        lowT = 0.0\n",
    "        lowK = 0.0\n",
    "        \n",
    "        for rep in range(int(reps)):\n",
    "            sam = np.random.uniform(size=n)  # the uniform part of the sample\n",
    "            ptMass = np.random.uniform(size=n)  # for a bunch of p-coin tosses\n",
    "            sam[ptMass < p] = 0.0   # mix in pointmass at 0, with probability p\n",
    "            # Kaplan's mixture\n",
    "            tLo = CI_from_test(sam, N, d=0, cl=cl, random_order = True, \n",
    "                                  p_value = HK_p)\n",
    "            coverT += ( tLo <= popMean )\n",
    "            lowT += tLo\n",
    "            #  Kaplan-Wald interval\n",
    "            kLo = CI_from_test(sam, N, d=d, cl=cl, random_order = True, \n",
    "                                  p_value = HK_ps_se_p) # lower endpoint of the Kaplan-Wald interval\n",
    "            coverK += (kLo <= popMean )\n",
    "            lowK += kLo\n",
    "\n",
    "        simTable.loc[len(simTable)] =  p, n,\\\n",
    "            str(100*float(coverT)/float(reps)) + '%',\\\n",
    "            str(100*float(coverK)/float(reps)) + '%',\\\n",
    "            str(round(lowT/float(reps),4)),\\\n",
    "            str(round(lowK/float(reps),4))\n",
    "#\n",
    "ansStr =  '<h3>Simulated coverage probability and expected lower endpoint of ' +\\\n",
    "          'one-sided Kaplan-Ville and Kaplan-Wald confidence intervals for ' +\\\n",
    "          'mixture of U[0,1] and pointmass at 0 population of' + str(N) + 'items</h3>' +\\\n",
    "          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
    "          '%</strong>. d=' + str(d) +\\\n",
    "          '.<br /><strong>Estimated from ' + str(int(reps)) +\\\n",
    "          ' replications.</strong>'\n",
    "\n",
    "display(HTML(ansStr))\n",
    "display(simTable)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}