{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Confidence intervals based on the normal approximation\n",
"\n",
"One of the first things most statisticians would consider is to use confidence based on the normal approximation, especially if it were possible to draw a moderately large sample (e.g., 100 or more units).\n",
"\n",
"The procedure would be:\n",
"+ choose the desired coverage probability $1-\\alpha$ and the sample size $n$\n",
"+ draw a simple random sample or a random sample with replacement of size $n$ from the population\n",
"+ compute the sample mean $\\bar{X}$ and sample standard deviation $s$\n",
"+ find $\\hat{SE} = f \\times s/\\sqrt{n}$, the estimated standard error of the sample mean, where $f=1$ for sampling with replacement, and $f = \\sqrt{\\frac{N-n}{N-1}}$ for sampling without replacement ($N$ is the\n",
"population size)\n",
"+ find $t$, the appropriate critical value of Student's t distribution with $n-1$ degrees of freedom\n",
"+ the approximate 2-sided confidence interval is $[\\bar{X} - t \\hat{SE}, \\bar{X} + t \\hat{SE}]$.\n",
"\n",
"### _Formally_, this gives an approximate confidence interval, but there's no guarantee that the actual coverage probability is close to the nominal coverage probability\n",
"\n",
"+ Normal approximation is common, but can be grossly inaccurate, especially with:\n",
" - small sample sizes\n",
" - skewed populations\n",
" \n",
"In auditing situations, typical for the distributions to be very skewed: point mass at zero, mixed with something else.\n",
"\n",
"+ Examine scenarios by simulation.\n",
"+ First scenario is a population of zeros and ones, in various proportions\n",
"+ Second scenario is \"nonstandard\" mixture of pointmass at 0 and a Uniform\n",
"\n",
"We are estimating the coverage by simulation. \n",
"\n",
"For simulations:\n",
"\n",
"+ it's important to know which pseudo-random number generator (PRNG) you are using (Python uses the Mersenne Twister by default)\n",
"+ it's important to assess whether that PRNG is adequate (the Mersenne twister is pretty good for statistical simulations, but not good enough for cryptographic applications)\n",
"+ it's important to remember that your estimates will have sampling error\n",
"+ if the result matters, you should quantify the sampling error, for instance, by a confidence interval for the estimated result.\n",
"\n",
"Here, we will use 1000 or 10,000 iterations in the estimates. For estimating probabilities, the standard error of the result is no larger than $1/(2\\sqrt{n})$: 1.58% for $n=1000$ and 0.5% for $n=10,000$.\n",
"\n",
"For estimating things like expected length, guarantees of accuracy are harder, but the kinds of techniques we will see over this short course can be applied to _that_ problem too."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# This is the first cell with code: set up the Python environment\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import math\n",
"import numpy as np\n",
"import scipy as sp\n",
"import scipy.stats\n",
"from scipy.stats import binom\n",
"import pandas as pd\n",
"# For interactive widgets\n",
"from ipywidgets import interact, interactive, fixed\n",
"import ipywidgets as widgets\n",
"from IPython.display import clear_output, display, HTML"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
Simulated coverage probability and expected length of Student-t confidence intervals for a {0, 1} population
Nominal coverage probability 95.0%.
Estimated from 10000 replications."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fraction of 1s | \n",
" sample size | \n",
" Student-t cov | \n",
" Student-t len | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.001 | \n",
" 25 | \n",
" 2.59% | \n",
" 0.0043 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.001 | \n",
" 50 | \n",
" 4.51% | \n",
" 0.0037 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.001 | \n",
" 100 | \n",
" 9.43% | \n",
" 0.0038 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.001 | \n",
" 400 | \n",
" 33.38% | \n",
" 0.0035 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.010 | \n",
" 25 | \n",
" 22.21% | \n",
" 0.0384 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.010 | \n",
" 50 | \n",
" 39.9% | \n",
" 0.0352 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.010 | \n",
" 100 | \n",
" 62.94% | \n",
" 0.0303 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.010 | \n",
" 400 | \n",
" 91.09% | \n",
" 0.0188 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.100 | \n",
" 25 | \n",
" 92.34% | \n",
" 0.232 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.100 | \n",
" 50 | \n",
" 87.83% | \n",
" 0.1665 | \n",
"
\n",
" \n",
" 10 | \n",
" 0.100 | \n",
" 100 | \n",
" 93.26% | \n",
" 0.1177 | \n",
"
\n",
" \n",
" 11 | \n",
" 0.100 | \n",
" 400 | \n",
" 95.31% | \n",
" 0.0589 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fraction of 1s sample size Student-t cov Student-t len\n",
"0 0.001 25 2.59% 0.0043\n",
"1 0.001 50 4.51% 0.0037\n",
"2 0.001 100 9.43% 0.0038\n",
"3 0.001 400 33.38% 0.0035\n",
"4 0.010 25 22.21% 0.0384\n",
"5 0.010 50 39.9% 0.0352\n",
"6 0.010 100 62.94% 0.0303\n",
"7 0.010 400 91.09% 0.0188\n",
"8 0.100 25 92.34% 0.232\n",
"9 0.100 50 87.83% 0.1665\n",
"10 0.100 100 93.26% 0.1177\n",
"11 0.100 400 95.31% 0.0589"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Population of two values, {0, 1}, in various proportions. Amounts to Binomial random variable\n",
"ns = np.array([25, 50, 100, 400]) # sample sizes\n",
"ps = np.array([.001, .01, 0.1]) # mixture fractions, proportion of 1s in the population\n",
"alpha = 0.05 # 1- (confidence level)\n",
"reps = int(1.0e4) # just for demonstration\n",
"vals = [0, 1]\n",
"\n",
"simTable = pd.DataFrame(columns=('fraction of 1s', 'sample size', 'Student-t cov', 'Student-t len'))\n",
"for p in ps:\n",
" popMean = p\n",
" for n in ns:\n",
" tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)\n",
" samMean = sp.stats.binom.rvs(n, p, size=reps)/float(n)\n",
" samSD = np.sqrt(samMean*(1-samMean)/(n-1))\n",
" cover = ( np.fabs(samMean-popMean) < tCrit*samSD).sum()\n",
" aveLen = 2*(tCrit*samSD).mean()\n",
" simTable.loc[len(simTable)] = p, n, str(100*float(cover)/float(reps)) + '%', str(round(aveLen,4))\n",
"#\n",
"ansStr = 'Simulated coverage probability and expected length of Student-t confidence intervals for a {0, 1} population
' +\\\n",
" 'Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
" '%.
Estimated from ' + str(reps) + ' replications.'\n",
"\n",
"display(HTML(ansStr))\n",
"display(simTable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why/When is the coverage bad?\n",
"\n",
"+ Because of the high population mass at 0, the sample will often contain _only_ 0s. Then the SD will be zero, the length of the interval will be 0, and the interval won't cover the mean—which is bigger than zero.\n",
"+ In general, coverage will be bad when sampling distribution of the sample mean is strongly skewed\n",
"\n",
"E.g., Binomial with very small or very large $p$."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fd6787dcfb3841cc8fb66c379c48b095"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def plotBinomialPmf(n, p):\n",
" '''\n",
" Plots the sampling distribution of the sample mean of a sample of size n\n",
" drawn with replacement from a {0, 1} population; i.e., a scaled binomial.\n",
" Plots a vertical red line at the expected value.\n",
" '''\n",
" fig, ax = plt.subplots(nrows=1, ncols=1)\n",
" x = np.arange(n+1)\n",
" xFrac = x.astype(float)/ x.max()\n",
" width = 1.0/n\n",
" ax.bar(xFrac, binom.pmf(x, n, p), width, align=\"center\")\n",
" plt.xlim([-width,1+width])\n",
" plt.axvline(x=p, color='r')\n",
" plt.show()\n",
"\n",
"interact(plotBinomialPmf, n=widgets.IntSlider(min=5, max=300, step=5, value=30),\\\n",
" p=widgets.FloatSlider(min=0.001, max=1, step=0.001, value=.001))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Nonstandard mixtures of distributions\n",
"\n",
"+ Common in statistical study of auditing to consider \"nonstandard mixtures\" of a pointmass and continuous distribution \n",
"+ E.g., NRC Monograph and [1989 Statistical Science article](http://projecteuclid.org/download/pdf_1/euclid.ss/1177012665)\n",
"+ Let's look at a mixture of $U[0,1]$ and a pointmass at zero"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Simulated coverage probability and expected length of Student-t confidence intervals for mixture of U[0,1] and pointmass at 0.
Nominal coverage probability 95.0%.
Estimated from 10000 replications."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mass at 0 | \n",
" sample size | \n",
" Student-t cov | \n",
" Student-t len | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.900 | \n",
" 25 | \n",
" 90.31% | \n",
" 0.6428 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.900 | \n",
" 50 | \n",
" 98.95% | \n",
" 0.6719 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.900 | \n",
" 100 | \n",
" 99.96% | \n",
" 0.68 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.900 | \n",
" 400 | \n",
" 100.0% | \n",
" 0.6869 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.990 | \n",
" 25 | \n",
" 22.61% | \n",
" 0.1011 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.990 | \n",
" 50 | \n",
" 39.54% | \n",
" 0.1297 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.990 | \n",
" 100 | \n",
" 62.67% | \n",
" 0.1627 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.990 | \n",
" 400 | \n",
" 98.05% | \n",
" 0.2098 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.999 | \n",
" 25 | \n",
" 2.26% | \n",
" 0.0092 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.999 | \n",
" 50 | \n",
" 4.95% | \n",
" 0.0145 | \n",
"
\n",
" \n",
" 10 | \n",
" 0.999 | \n",
" 100 | \n",
" 9.43% | \n",
" 0.0191 | \n",
"
\n",
" \n",
" 11 | \n",
" 0.999 | \n",
" 400 | \n",
" 32.97% | \n",
" 0.036 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mass at 0 sample size Student-t cov Student-t len\n",
"0 0.900 25 90.31% 0.6428\n",
"1 0.900 50 98.95% 0.6719\n",
"2 0.900 100 99.96% 0.68\n",
"3 0.900 400 100.0% 0.6869\n",
"4 0.990 25 22.61% 0.1011\n",
"5 0.990 50 39.54% 0.1297\n",
"6 0.990 100 62.67% 0.1627\n",
"7 0.990 400 98.05% 0.2098\n",
"8 0.999 25 2.26% 0.0092\n",
"9 0.999 50 4.95% 0.0145\n",
"10 0.999 100 9.43% 0.0191\n",
"11 0.999 400 32.97% 0.036"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Nonstandard mixture: a pointmass at zero and a uniform[0,1]\n",
"ns = np.array([25, 50, 100, 400]) # sample sizes\n",
"ps = np.array([0.9, 0.99, 0.999]) # mixture fraction, weight of pointmass\n",
"alpha = 0.05 # 1- (confidence level)\n",
"reps = int(1.0e4) # just for demonstration\n",
"\n",
"simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov', 'Student-t len'))\n",
"for p in ps:\n",
" popMean = (1-p)*0.5 # p*0 + (1-p)*.5\n",
" for n in ns:\n",
" tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)\n",
" cover = 0\n",
" totLen = 0.0\n",
" samMean = np.zeros(reps)\n",
" for rep in range(int(reps)):\n",
" sam = np.random.uniform(size=n)\n",
" ptMass = np.random.uniform(size=n)\n",
" sam[ptMass < p] = 0.0\n",
" samMean[rep] = np.mean(sam)\n",
" samSD = np.std(sam, ddof=1)\n",
" cover += ( math.fabs(samMean[rep]-popMean) < tCrit*samSD )\n",
" totLen += 2*tCrit*samSD\n",
" simTable.loc[len(simTable)] = p, n, str(100*float(cover)/float(reps)) + '%', str(round(totLen/reps,4))\n",
"#\n",
"ansStr = 'Simulated coverage probability and expected length of Student-t confidence intervals for mixture of U[0,1] and ' +\\\n",
" 'pointmass at 0.
' +\\\n",
" 'Nominal coverage probability ' + str(100*(1-alpha)) +\\\n",
" '%.
Estimated from ' + str(reps) + ' replications.'\n",
"\n",
"display(HTML(ansStr))\n",
"display(simTable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again the distribution of the sample mean is skewed, which hurts the coverage of Student-t intervals. \n",
"\n",
"When the mass at 0 gets large enough, there is again a large chance that the sample will contain _only_ 0s, in which case the sample SD will be zero, the length of the interval will be zero, and the interval won't cover."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"* The actual coverage probability of confidence intervals based on the normal approximation can be *much* less than the nominal coverage probability, even for large-ish sample sizes, for skewed population distributions.\n",
"* In many applications, the population _is_ very skewed: \"Nonstandard mixture distributions\"\n",
"* For example, accounting overstatements might be mostly 0, and occasionally large. Then:\n",
" * Can be very likely that the sample variance is zero\n",
" * Leads to apparent certainty that the result is correct."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"We will consider some conservative methods for constructing confidence bounds by inverting hypothesis tests.\n",
"\n",
"- [Next: Duality between hypothesis tests and confidence sets](duality.ipynb)\n",
"- [Previous: Canonical examples of real-world problems we will consider](canonical.ipynb)\n",
"- [Index](index.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}