{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 2\n", "\n", "Examples and Exercises from Think Stats, 2nd Edition\n", "\n", "http://thinkstats2.com\n", "\n", "Copyright 2016 Allen B. Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a list of values, there are several ways to count the frequency of each value." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "t = [1, 2, 2, 3, 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use a Python dictionary:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "hist = {}\n", "for x in t:\n", " hist[x] = hist.get(x, 0) + 1\n", " \n", "hist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use a `Counter` (which is a dictionary with additional methods):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "counter = Counter(t)\n", "counter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can use the `Hist` object provided by `thinkstats2`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import thinkstats2\n", "hist = thinkstats2.Hist([1, 2, 2, 3, 5])\n", "hist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Hist` provides `Freq`, which looks up the frequency of a value." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "hist.Freq(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use the bracket operator, which does the same thing." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "hist[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the value does not appear, it has frequency 0." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "hist[4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Values` method returns the values:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "hist.Values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So you can iterate the values and their frequencies like this:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "for val in sorted(hist.Values()):\n", " print(val, hist[val])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can use the `Items` method:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "for val, freq in hist.Items():\n", " print(val, freq)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`thinkplot` is a wrapper for `matplotlib` that provides functions that work with the objects in `thinkstats2`.\n", "\n", "For example `Hist` plots the values and their frequencies as a bar graph.\n", "\n", "`Config` takes parameters that label the x and y axes, among other things." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import thinkplot\n", "thinkplot.Hist(hist)\n", "thinkplot.Config(xlabel='value', ylabel='frequency')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example, I'll replicate some of the figures from the book.\n", "\n", "First, I'll load the data from the pregnancy file and select the records for live births." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import nsfg" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "preg = nsfg.ReadFemPreg()\n", "live = preg[preg.outcome == 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the histogram of birth weights in pounds. Notice that `Hist` works with anything iterable, including a Pandas Series. The `label` attribute appears in the legend when you plot the `Hist`. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')\n", "thinkplot.Hist(hist)\n", "thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='Count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before plotting the ages, I'll apply `floor` to round down:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "ages = np.floor(live.agepreg)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "hist = thinkstats2.Hist(ages, label='agepreg')\n", "thinkplot.Hist(hist)\n", "thinkplot.Config(xlabel='years', ylabel='Count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an exercise, plot the histogram of pregnancy lengths (column `prglngth`)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Hist` provides smallest, which select the lowest values and their frequencies." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "for weeks, freq in hist.Smallest(10):\n", " print(weeks, freq)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `Largest` to display the longest pregnancy lengths." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From live births, we can select first babies and others using `birthord`, then compute histograms of pregnancy length for the two groups." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "firsts = live[live.birthord == 1]\n", "others = live[live.birthord != 1]\n", "\n", "first_hist = thinkstats2.Hist(firsts.prglngth, label='first')\n", "other_hist = thinkstats2.Hist(others.prglngth, label='other')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `width` and `align` to plot two histograms side-by-side." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "width = 0.45\n", "thinkplot.PrePlot(2)\n", "thinkplot.Hist(first_hist, align='right', width=width)\n", "thinkplot.Hist(other_hist, align='left', width=width)\n", "thinkplot.Config(xlabel='weeks', ylabel='Count', xlim=[27, 46])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Series` provides methods to compute summary statistics:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "mean = live.prglngth.mean()\n", "var = live.prglngth.var()\n", "std = live.prglngth.std()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the mean and standard deviation:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "mean, std" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an exercise, confirm that `std` is the square root of `var`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's are the mean pregnancy lengths for first babies and others:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "firsts.prglngth.mean(), others.prglngth.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the difference (in weeks):" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "firsts.prglngth.mean() - others.prglngth.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This functon computes the Cohen effect size, which is the difference in means expressed in number of standard deviations:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def CohenEffectSize(group1, group2):\n", " \"\"\"Computes Cohen's effect size for two groups.\n", " \n", " group1: Series or DataFrame\n", " group2: Series or DataFrame\n", " \n", " returns: float if the arguments are Series;\n", " Series if the arguments are DataFrames\n", " \"\"\"\n", " diff = group1.mean() - group2.mean()\n", "\n", " var1 = group1.var()\n", " var2 = group2.var()\n", " n1, n2 = len(group1), len(group2)\n", "\n", " pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)\n", " d = diff / np.sqrt(pooled_var)\n", " return d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the Cohen effect size for the difference in pregnancy length for first babies and others." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the variable `totalwgt_lb`, investigate whether first babies are lighter or heavier than others. \n", "\n", "Compute Cohen’s effect size to quantify the difference between the groups. How does it compare to the difference in pregnancy length?" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the next few exercises, we'll load the respondent file:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dct\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dat.gz\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "resp = nsfg.ReadFemResp()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a histogram of totincr the total income for the respondent's family. To interpret the codes see the [codebook](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a histogram of age_r, the respondent's age at the time of interview." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a histogram of numfmhh, the number of people in the respondent's household." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a histogram of parity, the number of children borne by the respondent. How would you describe this distribution?" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use Hist.Largest to find the largest values of parity." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's investigate whether people with higher income have higher parity. Keep in mind that in this study, we are observing different people at different times during their lives, so this data is not the best choice for answering this question. But for now let's take it at face value.\n", "\n", "Use totincr to select the respondents with the highest income (level 14). Plot the histogram of parity for just the high income respondents." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the largest parities for high income respondents." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the mean parity for high income respondents and others." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the Cohen effect size for this difference. How does it compare with the difference in pregnancy length for first babies and others?" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }