{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 6\n", "\n", "Examples and Exercises from Think Stats, 2nd Edition\n", "\n", "http://thinkstats2.com\n", "\n", "Copyright 2016 Allen B. Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "import thinkstats2\n", "import thinkplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll start with the data from the BRFSS again." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/brfss.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/CDBRFS08.ASC.gz\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import brfss\n", "\n", "df = brfss.ReadBrfss(nrows=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the mean and standard deviation of female height in cm." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "female = df[df.sex==2]\n", "female_heights = female.htm3.dropna()\n", "mean, std = female_heights.mean(), female_heights.std()\n", "mean, std" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`NormalPdf` returns a Pdf object that represents the normal distribution with the given parameters.\n", "\n", "`Density` returns a probability density, which doesn't mean much by itself." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pdf = thinkstats2.NormalPdf(mean, std)\n", "pdf.Density(mean + std)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`thinkplot` provides `Pdf`, which plots the probability density with a smooth curve." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "thinkplot.Pdf(pdf, label='normal')\n", "thinkplot.Config(xlabel='x', ylabel='PDF', xlim=[140, 186])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Pdf` provides `MakePmf`, which returns a `Pmf` object that approximates the `Pdf`. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "pmf = pdf.MakePmf()\n", "thinkplot.Pmf(pmf, label='normal')\n", "thinkplot.Config(xlabel='x', ylabel='PDF', xlim=[140, 186])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a `Pmf`, you can also plot it using `Pdf`, if you have reason to think it should be represented as a smooth curve." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "thinkplot.Pdf(pmf, label='normal')\n", "thinkplot.Config(xlabel='x', ylabel='PDF', xlim=[140, 186])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a sample from the actual distribution, we can estimate the PDF using Kernel Density Estimation (KDE).\n", "\n", "If you run this a few times, you'll see how much variation there is in the estimate." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "thinkplot.Pdf(pdf, label='normal')\n", "\n", "sample = np.random.normal(mean, std, 500)\n", "sample_pdf = thinkstats2.EstimatedPdf(sample, label='sample')\n", "thinkplot.Pdf(sample_pdf, label='sample KDE')\n", "thinkplot.Config(xlabel='x', ylabel='PDF', xlim=[140, 186])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Moments\n", "\n", "Raw moments are just sums of powers." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def RawMoment(xs, k):\n", " return sum(x**k for x in xs) / len(xs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first raw moment is the mean. The other raw moments don't mean much." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "RawMoment(female_heights, 1), RawMoment(female_heights, 2), RawMoment(female_heights, 3)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def Mean(xs):\n", " return RawMoment(xs, 1)\n", "\n", "Mean(female_heights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The central moments are powers of distances from the mean." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def CentralMoment(xs, k):\n", " mean = RawMoment(xs, 1)\n", " return sum((x - mean)**k for x in xs) / len(xs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first central moment is approximately 0. The second central moment is the variance." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "CentralMoment(female_heights, 1), CentralMoment(female_heights, 2), CentralMoment(female_heights, 3)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def Var(xs):\n", " return CentralMoment(xs, 2)\n", "\n", "Var(female_heights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standardized moments are ratios of central moments, with powers chosen to make the dimensions cancel." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def StandardizedMoment(xs, k):\n", " var = CentralMoment(xs, 2)\n", " std = np.sqrt(var)\n", " return CentralMoment(xs, k) / std**k" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The third standardized moment is skewness." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "StandardizedMoment(female_heights, 1), StandardizedMoment(female_heights, 2), StandardizedMoment(female_heights, 3)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def Skewness(xs):\n", " return StandardizedMoment(xs, 3)\n", "\n", "Skewness(female_heights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally a negative skewness indicates that the distribution has a longer tail on the left. In that case, the mean is usually less than the median." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def Median(xs):\n", " cdf = thinkstats2.Cdf(xs)\n", " return cdf.Value(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But in this case the mean is greater than the median, which indicates skew to the right." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "Mean(female_heights), Median(female_heights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the skewness is based on the third moment, it is not robust; that is, it depends strongly on a few outliers. Pearson's median skewness is more robust." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def PearsonMedianSkewness(xs):\n", " median = Median(xs)\n", " mean = RawMoment(xs, 1)\n", " var = CentralMoment(xs, 2)\n", " std = np.sqrt(var)\n", " gp = 3 * (mean - median) / std\n", " return gp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pearson's skewness is positive, indicating that the distribution of female heights is slightly skewed to the right." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "PearsonMedianSkewness(female_heights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Birth weights\n", "\n", "Let's look at the distribution of birth weights again." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "import first\n", "\n", "live, firsts, others = first.MakeFrames()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on KDE, it looks like the distribution is skewed to the left." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "birth_weights = live.totalwgt_lb.dropna()\n", "pdf = thinkstats2.EstimatedPdf(birth_weights)\n", "thinkplot.Pdf(pdf, label='birth weight')\n", "thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean is less than the median, which is consistent with left skew." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "Mean(birth_weights), Median(birth_weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And both ways of computing skew are negative, which is consistent with left skew." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "Skewness(birth_weights), PearsonMedianSkewness(birth_weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adult weights\n", "\n", "Now let's look at adult weights from the BRFSS. The distribution looks skewed to the right." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "adult_weights = df.wtkg2.dropna()\n", "pdf = thinkstats2.EstimatedPdf(adult_weights)\n", "thinkplot.Pdf(pdf, label='Adult weight')\n", "thinkplot.Config(xlabel='Adult weight (kg)', ylabel='PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean is greater than the median, which is consistent with skew to the right." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "Mean(adult_weights), Median(adult_weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And both ways of computing skewness are positive." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "Skewness(adult_weights), PearsonMedianSkewness(adult_weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distribution of income is famously skewed to the right. In this exercise, we’ll measure how strong that skew is.\n", "The Current Population Survey (CPS) is a joint effort of the Bureau of Labor Statistics and the Census Bureau to study income and related variables. Data collected in 2013 is available from http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm. I downloaded `hinc06.xls`, which is an Excel spreadsheet with information about household income, and converted it to `hinc06.csv`, a CSV file you will find in the repository for this book. You will also find `hinc2.py`, which reads this file and transforms the data.\n", "\n", "The dataset is in the form of a series of income ranges and the number of respondents who fell in each range. The lowest range includes respondents who reported annual household income “Under \\$5000.” The highest range includes respondents who made “\\$250,000 or more.”\n", "\n", "To estimate mean and other statistics from these data, we have to make some assumptions about the lower and upper bounds, and how the values are distributed in each range. `hinc2.py` provides `InterpolateSample`, which shows one way to model this data. It takes a `DataFrame` with a column, `income`, that contains the upper bound of each range, and `freq`, which contains the number of respondents in each frame.\n", "\n", "It also takes `log_upper`, which is an assumed upper bound on the highest range, expressed in `log10` dollars. The default value, `log_upper=6.0` represents the assumption that the largest income among the respondents is $10^6$, or one million dollars.\n", "\n", "`InterpolateSample` generates a pseudo-sample; that is, a sample of household incomes that yields the same number of respondents in each range as the actual data. It assumes that incomes in each range are equally spaced on a `log10` scale." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "def InterpolateSample(df, log_upper=6.0):\n", " \"\"\"Makes a sample of log10 household income.\n", "\n", " Assumes that log10 income is uniform in each range.\n", "\n", " df: DataFrame with columns income and freq\n", " log_upper: log10 of the assumed upper bound for the highest range\n", "\n", " returns: NumPy array of log10 household income\n", " \"\"\"\n", " # compute the log10 of the upper bound for each range\n", " df['log_upper'] = np.log10(df.income)\n", "\n", " # get the lower bounds by shifting the upper bound and filling in\n", " # the first element\n", " df['log_lower'] = df.log_upper.shift(1)\n", " df.loc[0, 'log_lower'] = 3.0\n", "\n", " # plug in a value for the unknown upper bound of the highest range\n", " df.loc[41, 'log_upper'] = log_upper\n", " \n", " # use the freq column to generate the right number of values in\n", " # each range\n", " arrays = []\n", " for _, row in df.iterrows():\n", " vals = np.linspace(row.log_lower, row.log_upper, int(row.freq))\n", " arrays.append(vals)\n", "\n", " # collect the arrays into a single sample\n", " log_sample = np.concatenate(arrays)\n", " return log_sample\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc06.csv\")" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "import hinc\n", "income_df = hinc.ReadData()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "log_sample = InterpolateSample(income_df, log_upper=6.0)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "log_cdf = thinkstats2.Cdf(log_sample)\n", "thinkplot.Cdf(log_cdf)\n", "thinkplot.Config(xlabel='Household income (log $)',\n", " ylabel='CDF')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "sample = np.power(10, log_sample)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "cdf = thinkstats2.Cdf(sample)\n", "thinkplot.Cdf(cdf)\n", "thinkplot.Config(xlabel='Household income ($)',\n", " ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the median, mean, skewness and Pearson’s skewness of the resulting sample. What fraction of households report a taxable income below the mean? How do the results depend on the assumed upper bound?" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of this is based on an assumption that the highest income is one million dollars, but that's certainly not correct. What happens to the skew if the upper bound is 10 million?\n", "\n", "Without better information about the top of this distribution, we can't say much about the skewness of the distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }