{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Penny sampling and Conditional Expectation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Penny Sampling\n", "\n", "\"Penny sampling\" is a sampling and estimation approach closely related to DUS, but that allows particularly simple analysis. It was introduced by Edwards et al. (2013). \n", "\n", "We shall derive a continuous version of Penny Sampling that turns the problem of making confidence bounds for the mean of a bounded random variable into the problem of making confidence bounds for a Binomial $p$, by introducing auxiliary independent randomization. \n", "\n", "\n", "### Basic Penny Sampling \n", "We shall derive basic Penny Sampling in the context of financial auditing.\n", "\n", "Again, we have a population of $N$ items; item $j$ has unknown value $x_j$, but the value is known to be nonnegative and less than $u_j$.\n", "\n", "We divide the upper bound on into \"pennies.\" Item $j$ contains $u_j$ pennies, of which $x_j$ are \"good\" and $(u_j - x_j)$ are \"bad.\"\n", "\n", "Let $u \\equiv \\sum_{j=1}^N u_j$ and $x \\equiv \\sum_{j=1}^N x_j$.\n", "\n", "We want to estimate the population fraction of \"good\" pennies: \n", "\n", "$$\\mu \\equiv \\frac{x}{u}.$$\n", "\n", "We then sample \"pennies\" at random, without replacement, from the entire population.\n", "\n", "If we draw a penny associated with item $j$, we inspect that item to determine the value $x_j$.\n", "\n", "If the index of the penny within item $j$ is less than or equal to $x_j$, we consider the penny\n", "that was drawn to be \"good.\" Otherwise the penny is \"bad.\"\n", "\n", "The number of \"good\" pennies in the sample has a hypergeometric distribution with parameters \n", "$u$, $x = u \\mu$, and $n$; the expected fraction of \"good\" pennies in the sample is $\\mu = x/u$.\n", "\n", "We invert hypergeometric tests to find confidence bounds for $x$, which we can translate into confidence bounds\n", "for $\\mu$ by dividing by $u$.\n", "\n", "Again, we are imagining situations in which the sample is very small compared to the number of items in the population—much less the number of pennies in those items—so we will act as if we are sampling with replacement, which gives a conservative result in any case.\n", "\n", "Thus we treat the distribution of the number of \"good\" pennies in the sample as Binomial with parameters $n$ and $p = \\mu = x/u$, rather than hypergeometric.\n", "\n", "### Relation between penny sampling and PPS/DUS sampling\n", "\n", "While the chance of selecting each \"penny\" is equal, the chance of selecting item $j$ is proportional to $u_j$, just as in dollar-unit sampling. The difference is in how the data are analyzed: using $X_i$, the value of $x_j$ for the item selected on the $i$th draw, or using information about whether just the single \"penny\" is good or bad.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Continuous Penny Sampling and the Conditional Expectation\n", "\n", "### The Two-Step\n", "One can think of penny sampling as taking place in two steps: first, select an item $J$ with PPS sampling, then select a penny uniformly at random from that item. If the penny's index is less than $X_J$, the penny is good. In other words, pennies are selected conditionally uniformly, given the item they are selected from, and compared to the true value of the item. \n", "\n", "In the limit as pennies shrink in value, this amounts to comparing $X_J$ to a random variable $Z_J$ that is (continuously) conditionally uniformly distributed on $[0, U_J]$, where $U_J$ is the upper bound on the item selected. Let $W$ be the event that $Z_J \\le X_J$. Then\n", "$$\n", "\\begin{align}\n", " \\mathbb P \\{ W = 1 \\} & = \\mathbb P \\{ Z_J \\le X_J \\} \\\\\n", " &= \\sum_{j=1}^N \\mathbb P\\{Z_J \\le X_J | J = j\\} \\mathbb P \\{J=j\\} \\\\\n", " &= \\sum_{j=1}^N u_j/u \\mathbb P\\{Z_j \\le x_j | J = j\\} \\\\\n", " &= \\sum_{j=1}^N u_j/u \\cdot x_j/u_j \\\\\n", " &= x/u.\n", "\\end{align}\n", "$$\n", "In $n$ iid draws, the distribution of the sum of the corresponding values of $W$ is Binomial with parameters $n$ and $p=\\mu = x/u$.\n", "\n", "Alternatively, we can think of the process as creating a new set of $n$ random variables $\\{W_i\\}$ where $W_i \\sim \\mbox{Bernoulli}(X_i/U_i)$.\n", "It is clear that unconditionally, $\\{W_i\\}$ are iid Benoulli($x/u$), so $\\sum_{i=1}^n W_i \\sim \\mbox{Binomial}(n, \\mu)$, where $\\mu \\equiv x/u$.\n", "\n", "### Sequences of iid variables \n", "\n", "This kind of \"continuous penny sampling\" lets us make confidence bounds for iid samples from an arbitrary bounded distribution, such as the mixture of a uniform and a point-mass at 0. The lower and upper bounds on each draw are 0 and 1. For each $X_i$ in the sample, we generate a uniformly distributed $Z_i$, with all $\\{X_i\\}$ and $\\{Z_i\\}$ independent. Let $W_i = 1_{Z_i \\le X_i}$. \n", "Then $\\{W_i \\}_{i=1}^n$ are iid Bernoulli random variables with $\\mathbb P \\{W_i = 1\\} = \\mathbb E X_i = \\mu$:\n", "$$\n", "\\begin{align}\n", " \\mathbb P \\{ W_i = 1 \\} & = \\mathbb P \\{ Z_i \\le X_i \\} \\\\\n", " &= \\int_0^1 \\mathbb P \\{ Z_i \\le X_i | Z_i = z\\} dz \\\\\n", " &= \\int_0^1 \\mathbb P \\{ X_i > z \\} dz \\\\\n", " &= \\mathbb E X_i,\n", "\\end{align}\n", "$$\n", "by the tail-integral formula for expectation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation\n", "Let's implement continuous penny sampling in Python." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# This is the first cell with code: set up the Python environment\n", "%matplotlib inline\n", "from __future__ import division, print_function\n", "import matplotlib.pyplot as plt\n", "import math\n", "import numpy as np\n", "import scipy as sp\n", "import scipy.stats\n", "from scipy.stats import binom\n", "import pandas as pd\n", "from ipywidgets import interact, interactive, fixed\n", "import ipywidgets as widgets\n", "from IPython.display import clear_output, display, HTML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def binoLowerCL(n, x, cl = 0.975, inc=0.000001, p = None):\n", " \"Lower confidence level cl confidence interval for Binomial p, for x successes in n trials\"\n", " if p is None:\n", " p = float(x)/float(n)\n", " lo = 0.0\n", " if (x > 0):\n", " f = lambda q: cl - scipy.stats.binom.cdf(x-1, n, q)\n", " lo = sp.optimize.brentq(f, 0.0, p, xtol=inc)\n", " return lo\n", "\n", "def binoUpperCL(n, x, cl = 0.975, inc=0.000001, p = None):\n", " \"Upper confidence level cl confidence interval for Binomial p, for x successes in n trials\"\n", " if p is None:\n", " p = float(x)/float(n)\n", " hi = 1.0\n", " if (x < n):\n", " f = lambda q: scipy.stats.binom.cdf(x, n, q) - (1-cl)\n", " hi = sp.optimize.brentq(f, p, 1.0, xtol=inc) \n", " return hi\n", "\n", "def pennySampleReplacement(weights, n):\n", " '''\n", " Weighted random sample of size n drawn with replacement.\n", " Returns indices of the selected items, the \"remainder pennies\" (the conditionally uniform\n", " auxiliary randomization within items) and the raw uniform values used to select the sample\n", " '''\n", " if any(weights < 0):\n", " print('negative weight in weightedRandomSample')\n", " return float('NaN')\n", " else:\n", " totWt = np.sum(weights, dtype=float)\n", " wc = np.cumsum(weights, dtype=float)/totWt # ensure weights sum to 1\n", " theSam = np.random.random_sample((n,))\n", " inx = np.array(wc).searchsorted(theSam)\n", " penny = [(wc[inx[i]]-theSam[i])*totWt for i in range(n)]\n", " return inx, penny, theSam\n", "\n", "def pennyBinomialLowerBound(x, inx, pennies, cl=0.95):\n", " '''\n", " Penny sampling lower (one-sided) 1-alpha confidence bound on the mean, for sampling with replacement.\n", " x is the vector of observed values\n", " pennies is the vector of _which_ \"penny\" in each sampled item is to be adjudicated as \"good\" or \"bad\"\n", " The first x_j pennies in item j are deemed \"good,\" the remaining (u_j - x_j) are \"bad.\"\n", " Returns the lower bound and the number of \"good\" pennies in the sample.\n", " '''\n", " s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])\n", " n = len(inx)\n", " return binoLowerCL(n, s, cl=cl), s\n", "\n", "def pennyBinomialBounds(x, inx, pennies, cl=0.95):\n", " '''\n", " Penny sampling 2-sided confidence interval for the mean, for sampling with replacement.\n", " x is the vector of observed values\n", " pennies is the vector of _which_ \"penny\" in each sampled item is to be adjudicated as \"good\" or \"bad\"\n", " The first x_j pennies in item j are deemed \"good,\" the remaining (u_j - x_j) are \"bad.\"\n", " Returns the lower bound, the upper bound and the number of \"good\" pennies in the sample.\n", " '''\n", " s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])\n", " n = len(inx)\n", " return binoLowerCL(n, s, cl=1-(1-cl)/2), binoUpperCL(n, s, cl=1-(1-cl)/2), s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Illustration\n", "Let's imagine a population of $N$ items bounded between 0 and 1.\n", "\n", "Let's test this:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | mass at 0 | \n", "sample size | \n", "Student-t cov | \n", "Penny cov | \n", "Student-t len | \n", "Penny len | \n", "
---|---|---|---|---|---|---|
0 | \n", "0.900 | \n", "25.0 | \n", "90.37% | \n", "99.35% | \n", "0.6419 | \n", "0.207 | \n", "
1 | \n", "0.900 | \n", "50.0 | \n", "98.89% | \n", "98.7% | \n", "0.6724 | \n", "0.1382 | \n", "
2 | \n", "0.900 | \n", "100.0 | \n", "99.99% | \n", "98.41% | \n", "0.6817 | \n", "0.0942 | \n", "
3 | \n", "0.900 | \n", "400.0 | \n", "100.0% | \n", "95.9% | \n", "0.6868 | \n", "0.0451 | \n", "
4 | \n", "0.990 | \n", "25.0 | \n", "21.96% | \n", "99.32% | \n", "0.0975 | \n", "0.1451 | \n", "
5 | \n", "0.990 | \n", "50.0 | \n", "38.61% | \n", "99.82% | \n", "0.1253 | \n", "0.0795 | \n", "
6 | \n", "0.990 | \n", "100.0 | \n", "62.16% | \n", "98.49% | \n", "0.1604 | \n", "0.0447 | \n", "
7 | \n", "0.990 | \n", "400.0 | \n", "97.98% | \n", "98.31% | \n", "0.2098 | \n", "0.0167 | \n", "
8 | \n", "0.999 | \n", "25.0 | \n", "2.57% | \n", "98.67% | \n", "0.0106 | \n", "0.1381 | \n", "
9 | \n", "0.999 | \n", "50.0 | \n", "4.44% | \n", "97.87% | \n", "0.0127 | \n", "0.0719 | \n", "
10 | \n", "0.999 | \n", "100.0 | \n", "8.94% | \n", "99.93% | \n", "0.0178 | \n", "0.0371 | \n", "
11 | \n", "0.999 | \n", "400.0 | \n", "32.85% | \n", "98.16% | \n", "0.0357 | \n", "0.0101 | \n", "