{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 13\n", "\n", "Examples and Exercises from Think Stats, 2nd Edition\n", "\n", "http://thinkstats2.com\n", "\n", "Copyright 2016 Allen B. Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "import thinkstats2\n", "import thinkplot" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "try:\n", " import empiricaldist\n", "except ImportError:\n", " !pip install empiricaldist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Survival analysis\n", "\n", "If we have an unbiased sample of complete lifetimes, we can compute the survival function from the CDF and the hazard function from the survival function.\n", "\n", "Here's the distribution of pregnancy length in the NSFG dataset." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "import nsfg\n", "\n", "preg = nsfg.ReadFemPreg()\n", "complete = preg.query(\"outcome in [1, 3, 4]\").prglngth\n", "cdf = thinkstats2.Cdf(complete, label=\"cdf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The survival function is just the complementary CDF." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/survival.py\")" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "import survival\n", "\n", "\n", "def MakeSurvivalFromCdf(cdf, label=\"\"):\n", " \"\"\"Makes a survival function based on a CDF.\n", "\n", " cdf: Cdf\n", "\n", " returns: SurvivalFunction\n", " \"\"\"\n", " ts = cdf.xs\n", " ss = 1 - cdf.ps\n", " return survival.SurvivalFunction(ts, ss, label)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "sf = MakeSurvivalFromCdf(cdf, label=\"survival\")" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "print(cdf[13])\n", "print(sf[13])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the CDF and SF." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "thinkplot.Plot(sf)\n", "thinkplot.Cdf(cdf, alpha=0.2)\n", "thinkplot.Config(loc=\"center left\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the hazard function." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "hf = sf.MakeHazardFunction(label=\"hazard\")\n", "print(hf[39])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "thinkplot.Plot(hf)\n", "thinkplot.Config(ylim=[0, 0.75], loc=\"upper left\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Age at first marriage\n", "\n", "We'll use the NSFG respondent file to estimate the hazard function and survival function for age at first marriage." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dct\")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "resp6 = nsfg.ReadFemResp()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to clean up a few variables." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "resp6.cmmarrhx.replace([9997, 9998, 9999], np.nan, inplace=True)\n", "resp6[\"agemarry\"] = (resp6.cmmarrhx - resp6.cmbirth) / 12.0\n", "resp6[\"age\"] = (resp6.cmintvw - resp6.cmbirth) / 12.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the extract the age at first marriage for people who are married, and the age at time of interview for people who are not." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "complete = resp6[resp6.evrmarry == 1].agemarry.dropna()\n", "ongoing = resp6[resp6.evrmarry == 0].age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function uses Kaplan-Meier to estimate the hazard function." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "\n", "\n", "def EstimateHazardFunction(complete, ongoing, label=\"\", verbose=False):\n", " \"\"\"Estimates the hazard function by Kaplan-Meier.\n", "\n", " http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator\n", "\n", " complete: list of complete lifetimes\n", " ongoing: list of ongoing lifetimes\n", " label: string\n", " verbose: whether to display intermediate results\n", " \"\"\"\n", " if np.sum(np.isnan(complete)):\n", " raise ValueError(\"complete contains NaNs\")\n", " if np.sum(np.isnan(ongoing)):\n", " raise ValueError(\"ongoing contains NaNs\")\n", "\n", " hist_complete = Counter(complete)\n", " hist_ongoing = Counter(ongoing)\n", "\n", " ts = list(hist_complete | hist_ongoing)\n", " ts.sort()\n", "\n", " at_risk = len(complete) + len(ongoing)\n", "\n", " lams = pd.Series(index=ts, dtype=float)\n", " for t in ts:\n", " ended = hist_complete[t]\n", " censored = hist_ongoing[t]\n", "\n", " lams[t] = ended / at_risk\n", " if verbose:\n", " print(t, at_risk, ended, censored, lams[t])\n", " at_risk -= ended + censored\n", "\n", " return survival.HazardFunction(lams, label=label)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the hazard function and corresponding survival function. " ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "hf = EstimateHazardFunction(complete, ongoing)\n", "thinkplot.Plot(hf)\n", "thinkplot.Config(xlabel=\"Age (years)\", ylabel=\"Hazard\")" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "sf = hf.MakeSurvival()\n", "thinkplot.Plot(sf)\n", "thinkplot.Config(xlabel=\"Age (years)\", ylabel=\"Prob unmarried\", ylim=[0, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quantifying uncertainty\n", "\n", "To see how much the results depend on random sampling, we'll use a resampling process again." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "def EstimateMarriageSurvival(resp):\n", " \"\"\"Estimates the survival curve.\n", "\n", " resp: DataFrame of respondents\n", "\n", " returns: pair of HazardFunction, SurvivalFunction\n", " \"\"\"\n", " # NOTE: Filling missing values would be better than dropping them.\n", " complete = resp[resp.evrmarry == 1].agemarry.dropna()\n", " ongoing = resp[resp.evrmarry == 0].age\n", "\n", " hf = EstimateHazardFunction(complete, ongoing)\n", " sf = hf.MakeSurvival()\n", "\n", " return hf, sf" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "def ResampleSurvival(resp, iters=101):\n", " \"\"\"Resamples respondents and estimates the survival function.\n", "\n", " resp: DataFrame of respondents\n", " iters: number of resamples\n", " \"\"\"\n", " _, sf = EstimateMarriageSurvival(resp)\n", " thinkplot.Plot(sf)\n", "\n", " low, high = resp.agemarry.min(), resp.agemarry.max()\n", " ts = np.arange(low, high, 1 / 12.0)\n", "\n", " ss_seq = []\n", " for _ in range(iters):\n", " sample = thinkstats2.ResampleRowsWeighted(resp)\n", " _, sf = EstimateMarriageSurvival(sample)\n", " ss_seq.append(sf.Probs(ts))\n", "\n", " low, high = thinkstats2.PercentileRows(ss_seq, [5, 95])\n", " thinkplot.FillBetween(ts, low, high, color=\"gray\", label=\"90% CI\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following plot shows the survival function based on the raw data and a 90% CI based on resampling." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "ResampleSurvival(resp6)\n", "thinkplot.Config(\n", " xlabel=\"Age (years)\",\n", " ylabel=\"Prob unmarried\",\n", " xlim=[12, 46],\n", " ylim=[0, 1],\n", " loc=\"upper right\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SF based on the raw data falls outside the 90% CI because the CI is based on weighted resampling, and the raw data is not. You can confirm that by replacing `ResampleRowsWeighted` with `ResampleRows` in `ResampleSurvival`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More data\n", "\n", "To generate survivial curves for each birth cohort, we need more data, which we can get by combining data from several NSFG cycles." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/1995FemRespData.dat.gz\"\n", ")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2006_2010_FemRespSetup.dct\"\n", ")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2006_2010_FemResp.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "resp5 = survival.ReadFemResp1995()\n", "resp6 = survival.ReadFemResp2002()\n", "resp7 = survival.ReadFemResp2010()" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "resps = [resp5, resp6, resp7]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is the code from `survival.py` that generates SFs broken down by decade of birth." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "def AddLabelsByDecade(groups, **options):\n", " \"\"\"Draws fake points in order to add labels to the legend.\n", "\n", " groups: GroupBy object\n", " \"\"\"\n", " thinkplot.PrePlot(len(groups))\n", " for name, _ in groups:\n", " label = \"%d0s\" % name\n", " thinkplot.Plot([15], [1], label=label, **options)\n", "\n", "\n", "def EstimateMarriageSurvivalByDecade(groups, **options):\n", " \"\"\"Groups respondents by decade and plots survival curves.\n", "\n", " groups: GroupBy object\n", " \"\"\"\n", " thinkplot.PrePlot(len(groups))\n", " for _, group in groups:\n", " _, sf = EstimateMarriageSurvival(group)\n", " thinkplot.Plot(sf, **options)\n", "\n", "\n", "def PlotResampledByDecade(resps, iters=11, predict_flag=False, omit=None):\n", " \"\"\"Plots survival curves for resampled data.\n", "\n", " resps: list of DataFrames\n", " iters: number of resamples to plot\n", " predict_flag: whether to also plot predictions\n", " \"\"\"\n", " for i in range(iters):\n", " samples = [thinkstats2.ResampleRowsWeighted(resp) for resp in resps]\n", " sample = pd.concat(samples, ignore_index=True)\n", " groups = sample.groupby(\"decade\")\n", "\n", " if omit:\n", " groups = [(name, group) for name, group in groups if name not in omit]\n", "\n", " # TODO: refactor this to collect resampled estimates and\n", " # plot shaded areas\n", " if i == 0:\n", " AddLabelsByDecade(groups, alpha=0.7)\n", "\n", " if predict_flag:\n", " PlotPredictionsByDecade(groups, alpha=0.1)\n", " EstimateMarriageSurvivalByDecade(groups, alpha=0.1)\n", " else:\n", " EstimateMarriageSurvivalByDecade(groups, alpha=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results for the combined data." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "PlotResampledByDecade(resps)\n", "thinkplot.Config(\n", " xlabel=\"Age (years)\", ylabel=\"Prob unmarried\", xlim=[13, 45], ylim=[0, 1]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can generate predictions by assuming that the hazard function of each generation will be the same as for the previous generation." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "def PlotPredictionsByDecade(groups, **options):\n", " \"\"\"Groups respondents by decade and plots survival curves.\n", "\n", " groups: GroupBy object\n", " \"\"\"\n", " hfs = []\n", " for _, group in groups:\n", " hf, sf = EstimateMarriageSurvival(group)\n", " hfs.append(hf)\n", "\n", " thinkplot.PrePlot(len(hfs))\n", " for i, hf in enumerate(hfs):\n", " if i > 0:\n", " hf.Extend(hfs[i - 1])\n", " sf = hf.MakeSurvival()\n", " thinkplot.Plot(sf, **options)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's what that looks like." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "PlotResampledByDecade(resps, predict_flag=True)\n", "thinkplot.Config(\n", " xlabel=\"Age (years)\", ylabel=\"Prob unmarried\", xlim=[13, 45], ylim=[0, 1]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remaining lifetime\n", "\n", "Distributions with difference shapes yield different behavior for remaining lifetime as a function of age." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "preg = nsfg.ReadFemPreg()\n", "\n", "complete = preg.query(\"outcome in [1, 3, 4]\").prglngth\n", "print(\"Number of complete pregnancies\", len(complete))\n", "ongoing = preg[preg.outcome == 6].prglngth\n", "print(\"Number of ongoing pregnancies\", len(ongoing))\n", "\n", "hf = EstimateHazardFunction(complete, ongoing)\n", "sf1 = hf.MakeSurvival()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the expected remaining duration of a pregnancy as a function of the number of weeks elapsed. After week 36, the process becomes \"memoryless\"." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "rem_life1 = sf1.RemainingLifetime()\n", "thinkplot.Plot(rem_life1)\n", "thinkplot.Config(\n", " title=\"Remaining pregnancy length\", xlabel=\"Weeks\", ylabel=\"Mean remaining weeks\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the median remaining time until first marriage as a function of age." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "hf, sf2 = EstimateMarriageSurvival(resp6)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "func = lambda pmf: pmf.Percentile(50)\n", "rem_life2 = sf2.RemainingLifetime(filler=np.inf, func=func)\n", "\n", "thinkplot.Plot(rem_life2)\n", "thinkplot.Config(\n", " title=\"Years until first marriage\",\n", " ylim=[0, 15],\n", " xlim=[11, 31],\n", " xlabel=\"Age (years)\",\n", " ylabel=\"Median remaining years\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**Exercise:** In NSFG Cycles 6 and 7, the variable `cmdivorcx` contains the date of divorce for the respondent’s first marriage, if applicable, encoded in century-months.\n", "\n", "Compute the duration of marriages that have ended in divorce, and the duration, so far, of marriages that are ongoing. Estimate the hazard and survival curve for the duration of marriage.\n", "\n", "Use resampling to take into account sampling weights, and plot data from several resamples to visualize sampling error.\n", "\n", "Consider dividing the respondents into groups by decade of birth, and possibly by age at first marriage." ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "def CleanData(resp):\n", " \"\"\"Cleans respondent data.\n", "\n", " resp: DataFrame\n", " \"\"\"\n", " resp.cmdivorcx.replace([9998, 9999], np.nan, inplace=True)\n", "\n", " resp[\"notdivorced\"] = resp.cmdivorcx.isnull().astype(int)\n", " resp[\"duration\"] = (resp.cmdivorcx - resp.cmmarrhx) / 12.0\n", " resp[\"durationsofar\"] = (resp.cmintvw - resp.cmmarrhx) / 12.0\n", "\n", " month0 = pd.to_datetime(\"1899-12-15\")\n", " dates = [month0 + pd.DateOffset(months=cm) for cm in resp.cmbirth]\n", " resp[\"decade\"] = (pd.DatetimeIndex(dates).year - 1900) // 10" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "CleanData(resp6)\n", "married6 = resp6[resp6.evrmarry == 1]\n", "\n", "CleanData(resp7)\n", "married7 = resp7[resp7.evrmarry == 1]" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 1 }