{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 10\n",
    "\n",
    "Examples and Exercises from Think Stats, 2nd Edition\n",
    "\n",
    "http://thinkstats2.com\n",
    "\n",
    "Copyright 2016 Allen B. Downey\n",
    "\n",
    "MIT License: https://opensource.org/licenses/MIT\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os.path import basename, exists\n",
    "\n",
    "\n",
    "def download(url):\n",
    "    filename = basename(url)\n",
    "    if not exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print(\"Downloaded \" + local)\n",
    "\n",
    "\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import random\n",
    "\n",
    "import thinkstats2\n",
    "import thinkplot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Least squares\n",
    "\n",
    "One more time, let's load up the NSFG data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n",
    "download(\n",
    "    \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import first\n",
    "live, firsts, others = first.MakeFrames()\n",
    "live = live.dropna(subset=['agepreg', 'totalwgt_lb'])\n",
    "ages = live.agepreg\n",
    "weights = live.totalwgt_lb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The following function computes the intercept and slope of the least squares fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinkstats2 import Mean, MeanVar, Var, Std, Cov\n",
    "\n",
    "def LeastSquares(xs, ys):\n",
    "    meanx, varx = MeanVar(xs)\n",
    "    meany = Mean(ys)\n",
    "\n",
    "    slope = Cov(xs, ys, meanx, meany) / varx\n",
    "    inter = meany - slope * meanx\n",
    "\n",
    "    return inter, slope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Here's the least squares fit to birth weight as a function of mother's age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "inter, slope = LeastSquares(ages, weights)\n",
    "inter, slope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The intercept is often easier to interpret if we evaluate it at the mean of the independent variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "inter + slope * 25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And the slope is easier to interpret if we express it in pounds per decade (or ounces per year)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "slope * 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The following function evaluates the fitted line at the given `xs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def FitLine(xs, inter, slope):\n",
    "    fit_xs = np.sort(xs)\n",
    "    fit_ys = inter + slope * fit_xs\n",
    "    return fit_xs, fit_ys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And here's an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_xs, fit_ys = FitLine(ages, inter, slope)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Here's a scatterplot of the data with the fitted line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "thinkplot.Scatter(ages, weights, color='blue', alpha=0.1, s=10)\n",
    "thinkplot.Plot(fit_xs, fit_ys, color='white', linewidth=3)\n",
    "thinkplot.Plot(fit_xs, fit_ys, color='red', linewidth=2)\n",
    "thinkplot.Config(xlabel=\"Mother's age (years)\",\n",
    "                 ylabel='Birth weight (lbs)',\n",
    "                 axis=[10, 45, 0, 15],\n",
    "                 legend=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Residuals\n",
    "\n",
    "The following functon computes the residuals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def Residuals(xs, ys, inter, slope):\n",
    "    xs = np.asarray(xs)\n",
    "    ys = np.asarray(ys)\n",
    "    res = ys - (inter + slope * xs)\n",
    "    return res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now we can add the residuals as a column in the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "live['residual'] = Residuals(ages, weights, inter, slope)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "To visualize the residuals, I'll split the respondents into groups by age, then plot the percentiles of the residuals versus the average age in each group.\n",
    "\n",
    "First I'll make the groups and compute the average age in each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "bins = np.arange(10, 48, 3)\n",
    "indices = np.digitize(live.agepreg, bins)\n",
    "groups = live.groupby(indices)\n",
    "\n",
    "age_means = [group.agepreg.mean() for _, group in groups][1:-1]\n",
    "age_means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Next I'll compute the CDF of the residuals in each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The following function plots percentiles of the residuals against the average age in each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def PlotPercentiles(age_means, cdfs):\n",
    "    thinkplot.PrePlot(3)\n",
    "    for percent in [75, 50, 25]:\n",
    "        weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]\n",
    "        label = '%dth' % percent\n",
    "        thinkplot.Plot(age_means, weight_percentiles, label=label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows the 25th, 50th, and 75th percentiles.\n",
    "\n",
    "Curvature in the residuals suggests a non-linear relationship."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "PlotPercentiles(age_means, cdfs)\n",
    "\n",
    "thinkplot.Config(xlabel=\"Mother's age (years)\",\n",
    "                 ylabel='Residual (lbs)',\n",
    "                 xlim=[10, 45])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Sampling distribution\n",
    "\n",
    "To estimate the sampling distribution of `inter` and `slope`, I'll use resampling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def SampleRows(df, nrows, replace=False):\n",
    "    \"\"\"Choose a sample of rows from a DataFrame.\n",
    "\n",
    "    df: DataFrame\n",
    "    nrows: number of rows\n",
    "    replace: whether to sample with replacement\n",
    "\n",
    "    returns: DataDf\n",
    "    \"\"\"\n",
    "    indices = np.random.choice(df.index, nrows, replace=replace)\n",
    "    sample = df.loc[indices]\n",
    "    return sample\n",
    "\n",
    "def ResampleRows(df):\n",
    "    \"\"\"Resamples rows from a DataFrame.\n",
    "\n",
    "    df: DataFrame\n",
    "\n",
    "    returns: DataFrame\n",
    "    \"\"\"\n",
    "    return SampleRows(df, len(df), replace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The following function resamples the given dataframe and returns lists of estimates for `inter` and `slope`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def SamplingDistributions(live, iters=101):\n",
    "    t = []\n",
    "    for _ in range(iters):\n",
    "        sample = ResampleRows(live)\n",
    "        ages = sample.agepreg\n",
    "        weights = sample.totalwgt_lb\n",
    "        estimates = LeastSquares(ages, weights)\n",
    "        t.append(estimates)\n",
    "\n",
    "    inters, slopes = zip(*t)\n",
    "    return inters, slopes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Here's an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "inters, slopes = SamplingDistributions(live, iters=1001)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The following function takes a list of estimates and prints the mean, standard error, and 90% confidence interval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def Summarize(estimates, actual=None):\n",
    "    mean = Mean(estimates)\n",
    "    stderr = Std(estimates, mu=actual)\n",
    "    cdf = thinkstats2.Cdf(estimates)\n",
    "    ci = cdf.ConfidenceInterval(90)\n",
    "    print('mean, SE, CI', mean, stderr, ci)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Here's  the summary for `inter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "Summarize(inters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And for `slope`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "Summarize(slopes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**Exercise:** Use `ResampleRows` and generate a list of estimates for the mean birth weight.  Use `Summarize` to compute the SE and CI for these estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Visualizing uncertainty\n",
    "\n",
    "To show the uncertainty of the estimated slope and intercept, we can generate a fitted line for each resampled estimate and plot them on top of each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "for slope, inter in zip(slopes, inters):\n",
    "    fxs, fys = FitLine(age_means, inter, slope)\n",
    "    thinkplot.Plot(fxs, fys, color='gray', alpha=0.01)\n",
    "    \n",
    "thinkplot.Config(xlabel=\"Mother's age (years)\",\n",
    "                 ylabel='Residual (lbs)',\n",
    "                 xlim=[10, 45])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Or we can make a neater (and more efficient plot) by computing fitted lines and finding percentiles of the fits for each value of the dependent variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):\n",
    "    fys_seq = []\n",
    "    for inter, slope in zip(inters, slopes):\n",
    "        fxs, fys = FitLine(xs, inter, slope)\n",
    "        fys_seq.append(fys)\n",
    "\n",
    "    p = (100 - percent) / 2\n",
    "    percents = p, 100 - p\n",
    "    low, high = thinkstats2.PercentileRows(fys_seq, percents)\n",
    "    thinkplot.FillBetween(fxs, low, high, **options)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This example shows the confidence interval for the fitted values at each mother's age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "PlotConfidenceIntervals(age_means, inters, slopes, percent=90, \n",
    "                        color='gray', alpha=0.3, label='90% CI')\n",
    "PlotConfidenceIntervals(age_means, inters, slopes, percent=50,\n",
    "                        color='gray', alpha=0.5, label='50% CI')\n",
    "\n",
    "thinkplot.Config(xlabel=\"Mother's age (years)\",\n",
    "                 ylabel='Residual (lbs)',\n",
    "                 xlim=[10, 45])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Coefficient of determination\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The coefficient compares the variance of the residuals to the variance of the dependent variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "def CoefDetermination(ys, res):\n",
    "    return 1 - Var(res) / Var(ys)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For birth weight and mother's age $R^2$ is very small, indicating that the mother's age predicts a small part of the variance in birth weight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "inter, slope = LeastSquares(ages, weights)\n",
    "res = Residuals(ages, weights, inter, slope)\n",
    "r2 = CoefDetermination(weights, res)\n",
    "r2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We can confirm that $R^2 = \\rho^2$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('rho', thinkstats2.Corr(ages, weights))\n",
    "print('R', np.sqrt(r2))    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "To express predictive power, I think it's useful to compare the standard deviation of the residuals to the standard deviation of the dependent variable, as a measure RMSE if you try to guess birth weight with and without taking into account mother's age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Std(ys)', Std(weights))\n",
    "print('Std(res)', Std(res))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "As another example of the same idea, here's how much we can improve guesses about IQ if we know someone's SAT scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "var_ys = 15**2\n",
    "rho = 0.72\n",
    "r2 = rho**2\n",
    "var_res = (1 - r2) * var_ys\n",
    "std_res = np.sqrt(var_res)\n",
    "std_res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Hypothesis testing with slopes\n",
    "\n",
    "Here's a `HypothesisTest` that uses permutation to test whether the observed slope is statistically significant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SlopeTest(thinkstats2.HypothesisTest):\n",
    "\n",
    "    def TestStatistic(self, data):\n",
    "        ages, weights = data\n",
    "        _, slope = thinkstats2.LeastSquares(ages, weights)\n",
    "        return slope\n",
    "\n",
    "    def MakeModel(self):\n",
    "        _, weights = self.data\n",
    "        self.ybar = weights.mean()\n",
    "        self.res = weights - self.ybar\n",
    "\n",
    "    def RunModel(self):\n",
    "        ages, _ = self.data\n",
    "        weights = self.ybar + np.random.permutation(self.res)\n",
    "        return ages, weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And it is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "ht = SlopeTest((ages, weights))\n",
    "pvalue = ht.PValue()\n",
    "pvalue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Under the null hypothesis, the largest slope we observe after 1000 tries is substantially less than the observed value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "ht.actual, ht.MaxTestStat()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We can also use resampling to estimate the sampling distribution of the slope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampling_cdf = thinkstats2.Cdf(slopes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The distribution of slopes under the null hypothesis, and the sampling distribution of the slope under resampling, have the same shape, but one has mean at 0 and the other has mean at the observed slope.\n",
    "\n",
    "To compute a p-value, we can count how often the estimated slope under the null hypothesis exceeds the observed slope, or how often the estimated slope under resampling falls below 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "thinkplot.PrePlot(2)\n",
    "thinkplot.Plot([0, 0], [0, 1], color='0.8')\n",
    "ht.PlotCdf(label='null hypothesis')\n",
    "\n",
    "thinkplot.Cdf(sampling_cdf, label='sampling distribution')\n",
    "\n",
    "thinkplot.Config(xlabel='slope (lbs / year)',\n",
    "                   ylabel='CDF',\n",
    "                   xlim=[-0.03, 0.03],\n",
    "                   legend=True, loc='upper left')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Here's how to get a p-value from the sampling distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "pvalue = sampling_cdf[0]\n",
    "pvalue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Resampling with weights\n",
    "\n",
    "Resampling provides a convenient way to take into account the sampling weights associated with respondents in a stratified survey design.\n",
    "\n",
    "The following function resamples rows with probabilities proportional to weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ResampleRowsWeighted(df, column='finalwgt'):\n",
    "    weights = df[column]\n",
    "    cdf = thinkstats2.Cdf(dict(weights))\n",
    "    indices = cdf.Sample(len(weights))\n",
    "    sample = df.loc[indices]\n",
    "    return sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use it to estimate the mean birthweight and compute SE and CI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "iters = 100\n",
    "estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()\n",
    "             for _ in range(iters)]\n",
    "Summarize(estimates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And here's what the same calculation looks like if we ignore the weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()\n",
    "             for _ in range(iters)]\n",
    "Summarize(estimates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The difference is non-negligible, which suggests that there are differences in birth weight between the strata in the survey."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "**Exercise:** Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?\n",
    "\n",
    "Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the BRFSS data and extract heights and log weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/brfss.py\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/CDBRFS08.ASC.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "import brfss\n",
    "\n",
    "df = brfss.ReadBrfss(nrows=None)\n",
    "df = df.dropna(subset=['htm3', 'wtkg2'])\n",
    "heights, weights = df.htm3, df.wtkg2\n",
    "log_weights = np.log10(weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estimate intercept and slope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a scatter plot of the data and show the fitted line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make the same plot but apply the inverse transform to show weights on a linear (not log) scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot percentiles of the residuals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute correlation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute coefficient of determination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Confirm that $R^2 = \\rho^2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute Std(ys), which is the RMSE of predictions that don't use height."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute Std(res), the RMSE of predictions that do use height."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How much does height information reduce RMSE?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use resampling to compute sampling distributions for inter and slope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the sampling distribution of slope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the p-value of the slope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the 90% confidence interval of slope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the mean of the sampling distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the standard deviation of the sampling distribution, which is the standard error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Resample rows without weights, compute mean height, and summarize results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Resample rows with weights.  Note that the weight column in this dataset is called `finalwt`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}