{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 14\n", "\n", "Examples and Exercises from Think Stats, 2nd Edition\n", "\n", "http://thinkstats2.com\n", "\n", "Copyright 2016 Allen B. Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import random\n", "\n", "import thinkstats2\n", "import thinkplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analytic methods\n", "\n", "If we know the parameters of the sampling distribution, we can compute confidence intervals and p-values analytically, which is computationally faster than resampling." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import scipy.stats\n", "\n", "\n", "def EvalNormalCdfInverse(p, mu=0, sigma=1):\n", " return scipy.stats.norm.ppf(p, loc=mu, scale=sigma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the confidence interval for the estimated mean." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "EvalNormalCdfInverse(0.05, mu=90, sigma=2.5)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "EvalNormalCdfInverse(0.95, mu=90, sigma=2.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`normal.py` provides a `Normal` class that encapsulates what we know about arithmetic operations on normal distributions." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/normal.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/hypothesis.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg2.py\")\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from normal import Normal\n", "\n", "dist = Normal(90, 7.5**2)\n", "dist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use it to compute the sampling distribution of the mean." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "dist_xbar = dist.Sum(9) / 9\n", "dist_xbar.sigma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then compute a confidence interval." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "dist_xbar.Percentile(5), dist_xbar.Percentile(95)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Central Limit Theorem\n", "\n", "If you add up independent variates from a distribution with finite mean and variance, the sum converges on a normal distribution.\n", "\n", "The following function generates samples with difference sizes from an exponential distribution." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def MakeExpoSamples(beta=2.0, iters=1000):\n", " \"\"\"Generates samples from an exponential distribution.\n", "\n", " beta: parameter\n", " iters: number of samples to generate for each size\n", "\n", " returns: list of samples\n", " \"\"\"\n", " samples = []\n", " for n in [1, 10, 100]:\n", " sample = [np.sum(np.random.exponential(beta, n)) for _ in range(iters)]\n", " samples.append((n, sample))\n", " return samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function generates normal probability plots for samples with various sizes." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def NormalPlotSamples(samples, plot=1, ylabel=\"\"):\n", " \"\"\"Makes normal probability plots for samples.\n", "\n", " samples: list of samples\n", " label: string\n", " \"\"\"\n", " for n, sample in samples:\n", " thinkplot.SubPlot(plot)\n", " thinkstats2.NormalProbabilityPlot(sample)\n", "\n", " thinkplot.Config(\n", " title=\"n=%d\" % n,\n", " legend=False,\n", " xticks=[],\n", " yticks=[],\n", " xlabel=\"random normal variate\",\n", " ylabel=ylabel,\n", " )\n", " plot += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following plot shows how the sum of exponential variates converges to normal as sample size increases." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "thinkplot.PrePlot(num=3, rows=2, cols=3)\n", "samples = MakeExpoSamples()\n", "NormalPlotSamples(samples, plot=1, ylabel=\"sum of expo values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lognormal distribution has higher variance, so it requires a larger sample size before it converges to normal." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def MakeLognormalSamples(mu=1.0, sigma=1.0, iters=1000):\n", " \"\"\"Generates samples from a lognormal distribution.\n", "\n", " mu: parmeter\n", " sigma: parameter\n", " iters: number of samples to generate for each size\n", "\n", " returns: list of samples\n", " \"\"\"\n", " samples = []\n", " for n in [1, 10, 100]:\n", " sample = [np.sum(np.random.lognormal(mu, sigma, n)) for _ in range(iters)]\n", " samples.append((n, sample))\n", " return samples" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "thinkplot.PrePlot(num=3, rows=2, cols=3)\n", "samples = MakeLognormalSamples()\n", "NormalPlotSamples(samples, ylabel=\"sum of lognormal values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Pareto distribution has infinite variance, and sometimes infinite mean, depending on the parameters. It violates the requirements of the CLT and does not generally converge to normal." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def MakeParetoSamples(alpha=1.0, iters=1000):\n", " \"\"\"Generates samples from a Pareto distribution.\n", "\n", " alpha: parameter\n", " iters: number of samples to generate for each size\n", "\n", " returns: list of samples\n", " \"\"\"\n", " samples = []\n", "\n", " for n in [1, 10, 100]:\n", " sample = [np.sum(np.random.pareto(alpha, n)) for _ in range(iters)]\n", " samples.append((n, sample))\n", " return samples" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "thinkplot.PrePlot(num=3, rows=2, cols=3)\n", "samples = MakeParetoSamples()\n", "NormalPlotSamples(samples, ylabel=\"sum of Pareto values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the random variates are correlated, that also violates the CLT, so the sums don't generally converge.\n", "\n", "To generate correlated values, we generate correlated normal values and then transform to whatever distribution we want." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def GenerateCorrelated(rho, n):\n", " \"\"\"Generates a sequence of correlated values from a standard normal dist.\n", "\n", " rho: coefficient of correlation\n", " n: length of sequence\n", "\n", " returns: iterator\n", " \"\"\"\n", " x = random.gauss(0, 1)\n", " yield x\n", "\n", " sigma = np.sqrt(1 - rho**2)\n", " for _ in range(n - 1):\n", " x = random.gauss(x * rho, sigma)\n", " yield x" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def GenerateExpoCorrelated(rho, n):\n", " \"\"\"Generates a sequence of correlated values from an exponential dist.\n", "\n", " rho: coefficient of correlation\n", " n: length of sequence\n", "\n", " returns: NumPy array\n", " \"\"\"\n", " normal = list(GenerateCorrelated(rho, n))\n", " uniform = scipy.stats.norm.cdf(normal)\n", " expo = scipy.stats.expon.ppf(uniform)\n", " return expo" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def MakeCorrelatedSamples(rho=0.9, iters=1000):\n", " \"\"\"Generates samples from a correlated exponential distribution.\n", "\n", " rho: correlation\n", " iters: number of samples to generate for each size\n", "\n", " returns: list of samples\n", " \"\"\"\n", " samples = []\n", " for n in [1, 10, 100]:\n", " sample = [np.sum(GenerateExpoCorrelated(rho, n)) for _ in range(iters)]\n", " samples.append((n, sample))\n", " return samples" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "thinkplot.PrePlot(num=3, rows=2, cols=3)\n", "samples = MakeCorrelatedSamples()\n", "NormalPlotSamples(samples, ylabel=\"sum of correlated exponential values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Difference in means\n", "\n", "Let's use analytic methods to compute a CI and p-value for an observed difference in means.\n", "\n", "The distribution of pregnancy length is not normal, but it has finite mean and variance, so the sum (or mean) of a few thousand samples is very close to normal." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import first\n", "\n", "live, firsts, others = first.MakeFrames()\n", "delta = firsts.prglngth.mean() - others.prglngth.mean()\n", "delta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function computes the sampling distribution of the mean for a set of values and a given sample size." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def SamplingDistMean(data, n):\n", " \"\"\"Computes the sampling distribution of the mean.\n", "\n", " data: sequence of values representing the population\n", " n: sample size\n", "\n", " returns: Normal object\n", " \"\"\"\n", " mean, var = data.mean(), data.var()\n", " dist = Normal(mean, var)\n", " return dist.Sum(n) / n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the sampling distributions for the means of the two groups under the null hypothesis." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "dist1 = SamplingDistMean(live.prglngth, len(firsts))\n", "dist2 = SamplingDistMean(live.prglngth, len(others))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the sampling distribution for the difference in means." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "dist_diff = dist1 - dist2\n", "dist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Under the null hypothesis, here's the chance of exceeding the observed difference." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "1 - dist_diff.Prob(delta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the chance of falling below the negated difference." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "dist_diff.Prob(-delta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sum of these probabilities is the two-sided p-value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing a correlation\n", "\n", "Under the null hypothesis (that there is no correlation), the sampling distribution of the observed correlation (suitably transformed) is a \"Student t\" distribution." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def StudentCdf(n):\n", " \"\"\"Computes the CDF correlations from uncorrelated variables.\n", "\n", " n: sample size\n", "\n", " returns: Cdf\n", " \"\"\"\n", " ts = np.linspace(-3, 3, 101)\n", " ps = scipy.stats.t.cdf(ts, df=n - 2)\n", " rs = ts / np.sqrt(n - 2 + ts**2)\n", " return thinkstats2.Cdf(rs, ps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is a `HypothesisTest` that uses permutation to estimate the sampling distribution of a correlation. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "import hypothesis\n", "\n", "\n", "class CorrelationPermute(hypothesis.CorrelationPermute):\n", " \"\"\"Tests correlations by permutation.\"\"\"\n", "\n", " def TestStatistic(self, data):\n", " \"\"\"Computes the test statistic.\n", "\n", " data: tuple of xs and ys\n", " \"\"\"\n", " xs, ys = data\n", " return np.corrcoef(xs, ys)[0][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can estimate the sampling distribution by permutation and compare it to the Student t distribution." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def ResampleCorrelations(live):\n", " \"\"\"Tests the correlation between birth weight and mother's age.\n", "\n", " live: DataFrame for live births\n", "\n", " returns: sample size, observed correlation, CDF of resampled correlations\n", " \"\"\"\n", " live2 = live.dropna(subset=[\"agepreg\", \"totalwgt_lb\"])\n", " data = live2.agepreg.values, live2.totalwgt_lb.values\n", " ht = CorrelationPermute(data)\n", " p_value = ht.PValue()\n", " return len(live2), ht.actual, ht.test_cdf" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "n, r, cdf = ResampleCorrelations(live)\n", "\n", "model = StudentCdf(n)\n", "thinkplot.Plot(model.xs, model.ps, color=\"gray\", alpha=0.5, label=\"Student t\")\n", "thinkplot.Cdf(cdf, label=\"sample\")\n", "\n", "thinkplot.Config(xlabel=\"correlation\", ylabel=\"CDF\", legend=True, loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That confirms the analytic result. Now we can use the CDF of the Student t distribution to compute a p-value." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "t = r * np.sqrt((n - 2) / (1 - r**2))\n", "p_value = 1 - scipy.stats.t.cdf(t, df=n - 2)\n", "print(r, p_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chi-squared test\n", "\n", "The reason the chi-squared statistic is useful is that we can compute its distribution under the null hypothesis analytically." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "def ChiSquaredCdf(n):\n", " \"\"\"Discrete approximation of the chi-squared CDF with df=n-1.\n", "\n", " n: sample size\n", "\n", " returns: Cdf\n", " \"\"\"\n", " xs = np.linspace(0, 25, 101)\n", " ps = scipy.stats.chi2.cdf(xs, df=n - 1)\n", " return thinkstats2.Cdf(xs, ps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we can confirm the analytic result by comparing values generated by simulation with the analytic distribution." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "data = [8, 9, 19, 5, 8, 11]\n", "dt = hypothesis.DiceChiTest(data)\n", "p_value = dt.PValue(iters=1000)\n", "n, chi2, cdf = len(data), dt.actual, dt.test_cdf\n", "\n", "model = ChiSquaredCdf(n)\n", "thinkplot.Plot(model.xs, model.ps, color=\"gray\", alpha=0.3, label=\"chi squared\")\n", "thinkplot.Cdf(cdf, label=\"sample\")\n", "\n", "thinkplot.Config(xlabel=\"chi-squared statistic\", ylabel=\"CDF\", loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "And then we can use the analytic distribution to compute p-values." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "p_value = 1 - scipy.stats.chi2.cdf(chi2, df=n - 1)\n", "print(chi2, p_value)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**Exercise:** In Section 5.4, we saw that the distribution of adult weights is approximately lognormal. One possible explanation is that the weight a person gains each year is proportional to their current weight. In that case, adult weight is the product of a large number of multiplicative factors:\n", "\n", "w = w0 f1 f2 ... fn \n", "\n", "where w is adult weight, w0 is birth weight, and fi is the weight gain factor for year i.\n", "\n", "The log of a product is the sum of the logs of the factors:\n", "\n", "logw = logw0 + logf1 + logf2 + ... + logfn \n", "\n", "So by the Central Limit Theorem, the distribution of logw is approximately normal for large n, which implies that the distribution of w is lognormal.\n", "\n", "To model this phenomenon, choose a distribution for f that seems reasonable, then generate a sample of adult weights by choosing a random value from the distribution of birth weights, choosing a sequence of factors from the distribution of f, and computing the product. What value of n is needed to converge to a lognormal distribution?" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** In Section 14.6 we used the Central Limit Theorem to find the sampling distribution of the difference in means, δ, under the null hypothesis that both samples are drawn from the same population.\n", "\n", "We can also use this distribution to find the standard error of the estimate and confidence intervals, but that would only be approximately correct. To be more precise, we should compute the sampling distribution of δ under the alternate hypothesis that the samples are drawn from different populations.\n", "\n", "Compute this distribution and use it to calculate the standard error and a 90% confidence interval for the difference in means." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** [In a recent paper](http://ieeexplore.ieee.org/document/7044435/), Stein et al. investigate the effects of an intervention intended to mitigate gender-stereotypical task allocation within student engineering teams.\n", "\n", "Before and after the intervention, students responded to a survey that asked them to rate their contribution to each aspect of class projects on a 7-point scale.\n", "\n", "Before the intervention, male students reported higher scores for the programming aspect of the project than female students; on average men reported a score of 3.57 with standard error 0.28. Women reported 1.91, on average, with standard error 0.32.\n", "\n", "Compute the sampling distribution of the gender gap (the difference in means), and test whether it is statistically significant. Because you are given standard errors for the estimated means, you don’t need to know the sample size to figure out the sampling distributions.\n", "\n", "After the intervention, the gender gap was smaller: the average score for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE 0.16). Again, compute the sampling distribution of the gender gap and test it.\n", "\n", "Finally, estimate the change in gender gap; what is the sampling distribution of this change, and is it statistically significant?" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }