{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 2\n", "\n", "Visualize, describe, and model distributions\n", "\n", "Allen Downey\n", "\n", "[MIT License](https://en.wikipedia.org/wiki/MIT_License)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set(style='white')\n", "\n", "from utils import decorate\n", "from thinkstats2 import Pmf, Cdf\n", "\n", "import thinkstats2\n", "import thinkplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some of the functions from Chapter 5." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def MakeNormalModel(values, label=''):\n", " \"\"\"Plots a CDF with a Normal model.\n", "\n", " values: sequence\n", " \"\"\"\n", " cdf = thinkstats2.Cdf(values, label=label)\n", "\n", " mean, var = thinkstats2.TrimmedMeanVar(values)\n", " std = np.sqrt(var)\n", " print('n, mean, std', len(values), mean, std)\n", "\n", " xmin = mean - 4 * std\n", " xmax = mean + 4 * std\n", "\n", " xs, ps = thinkstats2.RenderNormalCdf(mean, std, xmin, xmax)\n", " thinkplot.Plot(xs, ps, label='model', linewidth=4, color='0.8')\n", " thinkplot.Cdf(cdf)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def MakeNormalPlot(values, label=''):\n", " \"\"\"Generates a normal probability plot.\n", "\n", " values: sequence\n", " \"\"\"\n", " mean, var = thinkstats2.TrimmedMeanVar(values, p=0.01)\n", " std = np.sqrt(var)\n", "\n", " xs = [-5, 5]\n", " xs, ys = thinkstats2.FitLine(xs, mean, std)\n", " thinkplot.Plot(xs, ys, color='0.8', label='model')\n", "\n", " xs, ys = thinkstats2.NormalProbability(values)\n", " thinkplot.Plot(xs, ys, '+', alpha=0.3, label=label)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the GSS data again." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%time gss = pd.read_hdf('gss.hdf5', 'gss')\n", "gss.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "gss.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most variables use special codes to indicate missing data. We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with `NaN`, which Pandas recognizes as a missing value." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def replace_invalid(df):\n", " df.realinc.replace([0], np.nan, inplace=True) \n", " df.educ.replace([98,99], np.nan, inplace=True)\n", " # 89 means 89 or older\n", " df.age.replace([98, 99], np.nan, inplace=True) \n", " df.cohort.replace([9999], np.nan, inplace=True)\n", " df.adults.replace([9], np.nan, inplace=True)\n", "\n", "replace_invalid(gss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distribution of age\n", "\n", "Here's the CDF of ages." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "cdf_age = Cdf(gss.age)\n", "thinkplot.cdf(cdf_age, label='age')\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (years)', \n", " ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**: Each of the following cells shows the distribution of ages under various transforms, compared to various models. In each text cell, add a sentence or two that interprets the result. What can we say about the distribution of ages based on each figure?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1) Here's the CDF of ages compared to a normal distribution with the same mean and standard deviation.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "MakeNormalModel(gss.age.dropna(), label='')\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (years)', \n", " ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2) Here's a normal probability plot for the distribution of ages.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "MakeNormalPlot(gss.age.dropna(), label='')\n", "\n", "decorate(title='Normal probability plot', \n", " xlabel='Standard normal sample', \n", " ylabel='Age (years)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3) Here's the complementary CDF on a log-y scale.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "thinkplot.cdf(cdf_age, label='age', complement=True)\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (years)', \n", " ylabel='Complementary CDF, log scale',\n", " yscale='log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4) Here's the CDF of ages on a log-x scale.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "thinkplot.cdf(cdf_age, label='age')\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (years)', \n", " ylabel='CDF',\n", " xscale='log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5) Here's the CDF of the logarithm of ages, compared to a normal model.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "values = np.log10(gss.age.dropna())\n", "MakeNormalModel(values, label='')\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (log10 years)', \n", " ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6) Here's a normal probability plot for the logarithm of ages.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [], "source": [ "MakeNormalPlot(values, label='')\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Standard normal sample', \n", " ylabel='Age (log10 years)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7) Here's the complementary CDF on a log-log scale.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [], "source": [ "thinkplot.cdf(cdf_age, label='age', complement=True)\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (years)', \n", " ylabel='Complementary CDF, log scale',\n", " xscale='log',\n", " yscale='log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8) Here's a test to see whether ages are well-modeled by a Weibull distribution.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [], "source": [ "thinkplot.cdf(cdf_age, label='age', transform='Weibull')\n", "\n", "decorate(title='Distribution of age', \n", " xlabel='Age (years)', \n", " ylabel='log Complementary CDF, log scale',\n", " xscale='log',\n", " yscale='log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distribution of income\n", "\n", "Here's the CDF of `realinc`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "cdf_realinc = Cdf(gss.realinc)\n", "thinkplot.cdf(cdf_realinc, label='income')\n", "\n", "decorate(title='Distribution of income', \n", " xlabel='Income (1986 $)', \n", " ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Use visualizations like the ones in the previous exercise to see whether there is an analytic model that describes the distribution of `gss.realinc` well." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2) Here's a normal probability plot for the values." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3) Here's the complementary CDF on a log-y scale." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4) Here's the CDF on a log-x scale." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5) Here's the CDF of the logarithm of the values, compared to a normal model." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6) Here's a normal probability plot for the logarithm of the values." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7) Here's the complementary CDF on a log-log scale." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8) Here's a test to see whether the values are well-modeled by a Weibull distribution.\n", "\n", "Interpretation:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BRFSS\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')\n", "brfss.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the distribution of height in the BRFSS dataset. Here's the CDF." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "heights = brfss.HTM4\n", "\n", "cdf_heights = Cdf(heights)\n", "thinkplot.Cdf(cdf_heights)\n", "\n", "decorate(xlabel='Height (cm)', ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see whether a normal model describes this data well, we can use KDE to estimate the PDF." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import gaussian_kde" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an example using the default bandwidth method." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "kde = gaussian_kde(heights.dropna())\n", "\n", "xs = np.linspace(heights.min(), heights.max())\n", "ds = kde.evaluate(xs)\n", "ds /= ds.sum()\n", "\n", "plt.plot(xs, ds, label='KDE heights')\n", "\n", "decorate(xlabel='Height (cm)', ylabel='PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It doesn't work very well; we can improve it by overriding the bandwidth with a constant." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "kde = gaussian_kde(heights.dropna(), bw_method=0.3)\n", "\n", "ds = kde.evaluate(xs)\n", "ds /= ds.sum()\n", "\n", "plt.plot(xs, ds, label='KDE heights')\n", "\n", "decorate(xlabel='Height (cm)', ylabel='PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can generate a normal model with the same mean and standard deviation." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "mean = heights.mean()\n", "std = heights.std()\n", "\n", "mean, std" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the model compared to the estimated PDF." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "normal_pdf = thinkstats2.NormalPdf(mean, std)\n", "\n", "ps = normal_pdf.Density(xs)\n", "ps /= ps.sum()\n", "\n", "plt.plot(xs, ps, color='gray', label='Normal model')\n", "plt.plot(xs, ds, label='KDE heights')\n", "\n", "decorate(xlabel='Height (cm)', ylabel='PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data don't fit the model particularly well, possibly because the distribution of heights is a mixture of two distributions, for men and women.\n", "\n", "**Exercise:** Generate a similar figure for just women's heights and see if the normal model does any better." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Generate a similar figure for men's weights, `brfss.WTKG3`. How well does the normal model fit?" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Try it one more time with the log of men's weights. How well does the normal model fit? What does that imply about the distribution of weight?" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Skewness\n", "\n", "Let's look at the skewness of the distribution of weights for men and women." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "male = (brfss.SEX == 1)\n", "male_weights = brfss.loc[male, 'WTKG3']" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "female = (brfss.SEX == 2)\n", "female_weights = brfss.loc[female, 'WTKG3']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we've seen, these distributions are skewed to the right, so we expect the mean to be higher than the median. " ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "male_weights.mean(), male_weights.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute the moment-based sample skewness using Pandas or `thinkstats2`. The results are almost the same." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "male_weights.skew(), thinkstats2.Skewness(male_weights.dropna())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But moment-based sample skewness is a terrible statistic! A more robust alternative is Pearson's median skewness:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "thinkstats2.PearsonMedianSkewness(male_weights.dropna())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Compute the same statistics for women. Which distribution is more skewed?" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Explore the GSS or BRFSS dataset and find something interesting!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 1 }