{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Finite-population SPRT for Population Percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example of the SPRT without replacement\n", "\n", "This notebook develops a sequential probability ratio test for the fraction of items labeled \"1\" in a population of $N$ items of which $Np$ are labeled $1$ and $N(1-p)$ are labeled \"0.\"\n", "\n", "This is a special case of the result derived in the notebook [Wald's Sequential Probability Ratio Test](sprt.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a population of $N$ items. \n", "Item $j$ has \"value\" $a_j \\in \\{0, 1\\}$. \n", "\n", "Define $p = \\frac{1}{N}\\sum_j a_j$ to be the population percentage.\n", "\n", "We want to test the hypothesis $H_0$ that $p = p_0$ against the\n", "alternative hypothesis $H_1$ that $p = p_1 $, for some\n", "fixed $p_1 > p_0$.\n", "\n", "We will draw items sequentially, without replacement, such that the chance that item $i$ is selected in draw $j$, assuming it has not been selected already, is $1/(N-j+1)$.\n", "Let ${\\mathcal B_{j-1}}$ be the indices of the items selected up to and including the $j-1$st draw,\n", "and ${\\mathcal B_0} \\equiv \\emptyset$. \n", "\n", "Let $\\mathbb B_j$ denote the index of the item selected at random in the $j$th draw.\n", "\n", "The chance that the first draw ${\\mathbb B_1}$ gives an item with value 1, i.e., \n", "$\\Pr \\{a_{\\mathbb B_1} = 1\\}$, is $\\frac{1}{N}\\sum_b a_b$.\n", "Under $H_0$, this chance is $p_{01} = p_0$; under $H_1$, this chance is \n", "$p_{11} = p_1$.\n", "\n", "Given the values of $\\{a_{\\mathbb B_k}\\}_{k=1}^i$, the conditional\n", "probability that the $i$th draw gives an item with value 1 is\n", "\n", "$$\n", " \\Pr \\{a_{\\mathbb B_i} = 1 | {\\mathcal B_{i-1}} \\} = \\frac{ \\sum_{b \\notin {\\mathcal B_{i-1}}} a_b}{N-i+1}.\n", "$$\n", "\n", "Under $H_0$, this chance is\n", "\n", "$$\n", " p_{0i} = \\frac{Np_0 - \\sum_{b \\in {\\mathcal B_{i-1}}} a_b}{N - i + 1}.\n", "$$\n", "\n", "Under $H_1$, this chance is\n", "\n", "$$\n", " p_{1i} = \\frac{Np_1 - \\sum_{b \\in {\\mathcal B_{i-1}}} a_b}{N - i + 1}.\n", "$$\n", "\n", "Let $X_i$ be the indicator of the event that the $i$th draw gives an item with\n", "value $1$, i.e., the indicator of the event $a_{\\mathbb B_i} = 1$.\n", "The likelihood ratio for a given sequence $\\{X_k\\}_{k=1}^i$ is\n", "\n", "$$\n", " \\mbox{LR} = \\frac{\\prod_{k=1}^i p_{1k}^{X_k}(1-p_{1k})^{1-X_k}}\n", " {\\prod_{k=1}^i p_{0k}^{X_k}(1-p_{0k})^{1-X_k}}.\n", "$$\n", "\n", "This can be simplified: \n", "$p_{0k}$ and $p_{1k}$ have the same denominator,\n", "$N - k + 1$, and their numerators share a term.\n", "Define $A(k) \\equiv \\sum_{b \\in {\\mathcal B_{k-1}}}$.\n", "Then\n", "\n", "$$\n", " \\mbox{LR} = \\prod_{k=1}^i \n", " \\left ( \\frac{Np_1 - A(k)}{Np_0 - A(k)} \\right )^{X_k}\n", " \\left ( \\frac{N(1-p_1) - (k-1-A(k))}{N(1-p_0) - (k-1 - A(k))} \\right )^{1-X_k}.\n", "$$\n", "where the products are defined to be infinite if any denominator vanishes (or is negative).\n", "\n", "If $H_0$ is true, the chance that $\\mbox{LR}$ is ever greater than $1/\\alpha$\n", "is at most $\\alpha$.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# This is the first cell with code: set up the Python environment\n", "%matplotlib inline\n", "from __future__ import division, print_function\n", "import matplotlib.pyplot as plt\n", "import math\n", "import numpy as np\n", "import numpy.random\n", "import scipy as sp\n", "import scipy.stats\n", "# For interactive widgets\n", "from ipywidgets import interact, interactive, fixed\n", "import ipywidgets as widgets\n", "from IPython.display import clear_output, display, HTML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.random.seed(1234567890) # set seed for reproducibility" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def LRFromTrials(trials, N, p0, p1):\n", " '''\n", " Finds the sequence of likelihood ratios for the hypothesis that the population \n", " percentage is p1 to the hypothesis that it is p0, for sampling without replacement\n", " from a population of size N.\n", " '''\n", " A = np.cumsum(np.insert(trials, 0, 0)) # so that cumsum does the right thing\n", " terms = np.ones(N)\n", " for k in range(len(trials)):\n", " if trials[k] == 1.0:\n", " if (N*p0 - A[k]) > 0:\n", " terms[k] = np.max([N*p1 - A[k], 0])/(N*p0 - A[k])\n", " else:\n", " terms[k] = np.inf\n", " else:\n", " if (N*(1-p0) - k + 1 + A[k]) > 0:\n", " terms[k] = np.max([(N*(1-p1) - k + 1 + A[k]), 0])/(N*(1-p0) - k + 1 + A[k])\n", " else:\n", " terms[k] = np.inf\n", " return(np.cumprod(terms))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "99382cb24ae440fa95cb27ea9a69e667" } }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def plotBernoulliSPRT(N, p, p0, p1, alpha):\n", " '''\n", " Plots the progress of a one-sided SPRT for N dependent Bernoulli trials \n", " in sampling without replacement from a population of size N with a \n", " fraction p of items labeled \"1,\" for testing the hypothesis that p=p0 \n", " against the hypothesis p=p1 at significance level alpha\n", " '''\n", " plt.clf()\n", " fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)\n", " trials = np.zeros(N)\n", " nOnes = int(math.floor(N*p))\n", " trials[0:nOnes] = np.ones(nOnes)\n", " np.random.shuffle(trials) # items are in random order\n", "\n", " LRs = np.nan_to_num(LRFromTrials(trials, N, p0, p1))\n", " logLRs = np.nan_to_num(np.log(LRs))\n", " \n", " LRs[LRs > 10**6] = 10**6 # avoid plot overflow\n", " logLRs[logLRs > 10**6] = 10**6 # avoid plot overflow\n", " \n", " #\n", " ax[0].plot(range(N),LRs, color='b')\n", " ax[0].axhline(y=1/alpha, xmin=0, xmax=N, color='g', label=r'$1/\\alpha$')\n", " ax[0].set_title('Simulation of Wald\\'s SPRT for population percentage, w/o replacement')\n", " ax[0].set_ylabel('LR')\n", " ax[0].legend(loc='best')\n", " #\n", " ax[1].plot(range(N),logLRs, color='b', linestyle='--')\n", " ax[1].axhline(y=math.log(1/alpha), xmin=0, xmax=N, color='g', label=r'$log(1/\\alpha)$')\n", " ax[1].set_ylabel('log(LR)')\n", " ax[1].set_xlabel('trials')\n", " ax[1].legend(loc='best')\n", " plt.show()\n", "\n", "\n", "interact(plotBernoulliSPRT,\\\n", " N=widgets.IntSlider(min=500, max=50000, step=500, value=5000),\\\n", " p=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.51),\\\n", " p0=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.5),\\\n", " p1=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.51),\\\n", " alpha=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.05)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulate the distribution of sample sizes needed to reject" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "alpha = 0.05 # significance level\n", "reps = int(10**4) # number of replications\n", "p, p0, p1 = [0.525, 0.5, 0.525] # need p > p0 or might never reject\n", "N = 10000 # population size\n", "dist = np.zeros(reps) # allocate space for the results\n", "\n", "trials = np.zeros(N)\n", "nOnes = int(math.floor(N*p))\n", "trials[0:nOnes] = np.ones(nOnes) # trials now contains math.floor(n*p) ones\n", "\n", "for i in np.arange(reps):\n", " np.random.shuffle(trials) # items are in random order\n", " LRs = LRFromTrials(trials, N, p0, p1) # likelihood ratios for this realization\n", " dist[i] = np.min(np.where(LRs >= 1/alpha)) # trials at which threshold is crossed" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1784.1438 [ 1526. 3265.]\n" ] } ], "source": [ "# report mean, median, and 90th percentile\n", "print(np.mean(dist), np.percentile(dist, [50, 90]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's next?\n", "\n", "+ [Next: The Kaplan-Wald Confidence Bound for a Nonnegative Mean](kaplanWald.ipynb)\n", "+ [Previous: Wald's Sequential Probability Ratio Test](sprt.ipynb)\n", "+ [Index](index.ipynb)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }