{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework 2\n",
    "\n",
    "Visualize, describe, and model distributions\n",
    "\n",
    "Allen Downey\n",
    "\n",
    "[MIT License](https://en.wikipedia.org/wiki/MIT_License)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set(style='white')\n",
    "\n",
    "from utils import decorate\n",
    "from thinkstats2 import Pmf, Cdf\n",
    "\n",
    "import thinkstats2\n",
    "import thinkplot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some of the functions from Chapter 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def MakeNormalModel(values, label=''):\n",
    "    \"\"\"Plots a CDF with a Normal model.\n",
    "\n",
    "    values: sequence\n",
    "    \"\"\"\n",
    "    cdf = thinkstats2.Cdf(values, label=label)\n",
    "\n",
    "    mean, var = thinkstats2.TrimmedMeanVar(values)\n",
    "    std = np.sqrt(var)\n",
    "    print('n, mean, std', len(values), mean, std)\n",
    "\n",
    "    xmin = mean - 4 * std\n",
    "    xmax = mean + 4 * std\n",
    "\n",
    "    xs, ps = thinkstats2.RenderNormalCdf(mean, std, xmin, xmax)\n",
    "    thinkplot.Plot(xs, ps, label='model', linewidth=4, color='0.8')\n",
    "    thinkplot.Cdf(cdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def MakeNormalPlot(values, label=''):\n",
    "    \"\"\"Generates a normal probability plot.\n",
    "\n",
    "    values: sequence\n",
    "    \"\"\"\n",
    "    mean, var = thinkstats2.TrimmedMeanVar(values, p=0.01)\n",
    "    std = np.sqrt(var)\n",
    "\n",
    "    xs = [-5, 5]\n",
    "    xs, ys = thinkstats2.FitLine(xs, mean, std)\n",
    "    thinkplot.Plot(xs, ys, color='0.8', label='model')\n",
    "\n",
    "    xs, ys = thinkstats2.NormalProbability(values)\n",
    "    thinkplot.Plot(xs, ys, '+', alpha=0.3, label=label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the GSS data again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time gss = pd.read_hdf('gss.hdf5', 'gss')\n",
    "gss.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most variables use special codes to indicate missing data.  We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with `NaN`, which Pandas recognizes as a missing value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def replace_invalid(df):\n",
    "    df.realinc.replace([0], np.nan, inplace=True)                  \n",
    "    df.educ.replace([98,99], np.nan, inplace=True)\n",
    "    # 89 means 89 or older\n",
    "    df.age.replace([98, 99], np.nan, inplace=True) \n",
    "    df.cohort.replace([9999], np.nan, inplace=True)\n",
    "    df.adults.replace([9], np.nan, inplace=True)\n",
    "\n",
    "replace_invalid(gss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distribution of age\n",
    "\n",
    "Here's the CDF of ages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "cdf_age = Cdf(gss.age)\n",
    "thinkplot.cdf(cdf_age, label='age')\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (years)', \n",
    "         ylabel='CDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**:  Each of the following cells shows the distribution of ages under various transforms, compared to various models.  In each text cell, add a sentence or two that interprets the result.  What can we say about the distribution of ages based on each figure?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1) Here's the CDF of ages compared to a normal distribution with the same mean and standard deviation.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "MakeNormalModel(gss.age.dropna(), label='')\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (years)', \n",
    "         ylabel='CDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2) Here's a normal probability plot for the distribution of ages.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "MakeNormalPlot(gss.age.dropna(), label='')\n",
    "\n",
    "decorate(title='Normal probability plot', \n",
    "         xlabel='Standard normal sample', \n",
    "         ylabel='Age (years)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3) Here's the complementary CDF on a log-y scale.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "thinkplot.cdf(cdf_age, label='age', complement=True)\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (years)', \n",
    "         ylabel='Complementary CDF, log scale',\n",
    "         yscale='log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4) Here's the CDF of ages on a log-x scale.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "thinkplot.cdf(cdf_age, label='age')\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (years)', \n",
    "         ylabel='CDF',\n",
    "         xscale='log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5) Here's the CDF of the logarithm of ages, compared to a normal model.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "values = np.log10(gss.age.dropna())\n",
    "MakeNormalModel(values, label='')\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (log10 years)', \n",
    "         ylabel='CDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6) Here's a normal probability plot for the logarithm of ages.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "MakeNormalPlot(values, label='')\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Standard normal sample', \n",
    "         ylabel='Age (log10 years)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7) Here's the complementary CDF on a log-log scale.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "thinkplot.cdf(cdf_age, label='age', complement=True)\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (years)', \n",
    "         ylabel='Complementary CDF, log scale',\n",
    "         xscale='log',\n",
    "         yscale='log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8) Here's a test to see whether ages are well-modeled by a Weibull distribution.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "thinkplot.cdf(cdf_age, label='age', transform='Weibull')\n",
    "\n",
    "decorate(title='Distribution of age', \n",
    "         xlabel='Age (years)', \n",
    "         ylabel='log Complementary CDF, log scale',\n",
    "         xscale='log',\n",
    "         yscale='log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distribution of income\n",
    "\n",
    "Here's the CDF of `realinc`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "cdf_realinc = Cdf(gss.realinc)\n",
    "thinkplot.cdf(cdf_realinc, label='income')\n",
    "\n",
    "decorate(title='Distribution of income', \n",
    "         xlabel='Income (1986 $)', \n",
    "         ylabel='CDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Use visualizations like the ones in the previous exercise to see whether there is an analytic model that describes the distribution of `gss.realinc` well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2) Here's a normal probability plot for the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3) Here's the complementary CDF on a log-y scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4) Here's the CDF on a log-x scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5) Here's the CDF of the logarithm of the values, compared to a normal model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6) Here's a normal probability plot for the logarithm of the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7) Here's the complementary CDF on a log-log scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8) Here's a test to see whether the values are well-modeled by a Weibull distribution.\n",
    "\n",
    "Interpretation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BRFSS\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')\n",
    "brfss.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the distribution of height in the BRFSS dataset.  Here's the CDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "heights = brfss.HTM4\n",
    "\n",
    "cdf_heights = Cdf(heights)\n",
    "thinkplot.Cdf(cdf_heights)\n",
    "\n",
    "decorate(xlabel='Height (cm)', ylabel='CDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see whether a normal model describes this data well, we can use KDE to estimate the PDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import gaussian_kde"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an example using the default bandwidth method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "kde = gaussian_kde(heights.dropna())\n",
    "\n",
    "xs = np.linspace(heights.min(), heights.max())\n",
    "ds = kde.evaluate(xs)\n",
    "ds /= ds.sum()\n",
    "\n",
    "plt.plot(xs, ds, label='KDE heights')\n",
    "\n",
    "decorate(xlabel='Height (cm)', ylabel='PDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It doesn't work very well; we can improve it by overriding the bandwidth with a constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "kde = gaussian_kde(heights.dropna(), bw_method=0.3)\n",
    "\n",
    "ds = kde.evaluate(xs)\n",
    "ds /= ds.sum()\n",
    "\n",
    "plt.plot(xs, ds, label='KDE heights')\n",
    "\n",
    "decorate(xlabel='Height (cm)', ylabel='PDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can generate a normal model with the same mean and standard deviation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean = heights.mean()\n",
    "std = heights.std()\n",
    "\n",
    "mean, std"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the model compared to the estimated PDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "normal_pdf = thinkstats2.NormalPdf(mean, std)\n",
    "\n",
    "ps = normal_pdf.Density(xs)\n",
    "ps /= ps.sum()\n",
    "\n",
    "plt.plot(xs, ps, color='gray', label='Normal model')\n",
    "plt.plot(xs, ds, label='KDE heights')\n",
    "\n",
    "decorate(xlabel='Height (cm)', ylabel='PDF')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data don't fit the model particularly well, possibly because the distribution of heights is a mixture of two distributions, for men and women.\n",
    "\n",
    "**Exercise:** Generate a similar figure for just women's heights and see if the normal model does any better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Generate a similar figure for men's weights, `brfss.WTKG3`.  How well does the normal model fit?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Try it one more time with the log of men's weights.  How well does the normal model fit?  What does that imply about the distribution of weight?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Skewness\n",
    "\n",
    "Let's look at the skewness of the distribution of weights for men and women."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "male = (brfss.SEX == 1)\n",
    "male_weights = brfss.loc[male, 'WTKG3']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "female = (brfss.SEX == 2)\n",
    "female_weights = brfss.loc[female, 'WTKG3']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we've seen, these distributions are skewed to the right, so we expect the mean to be higher than the median. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "male_weights.mean(), male_weights.median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute the moment-based sample skewness using Pandas or `thinkstats2`.  The results are almost the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "male_weights.skew(), thinkstats2.Skewness(male_weights.dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But moment-based sample skewness is a terrible statistic!  A more robust alternative is Pearson's median skewness:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "thinkstats2.PearsonMedianSkewness(male_weights.dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Compute the same statistics for women.  Which distribution is more skewed?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Explore the GSS or BRFSS dataset and find something interesting!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}