{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Think Bayes\n", "\n", "This notebook presents example code and exercise solutions for Think Bayes.\n", "\n", "Copyright 2016 Allen B. Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Configure Jupyter so figures appear in the notebook\n", "%matplotlib inline\n", "\n", "# Configure Jupyter to display the assigned value after an assignment\n", "%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'\n", "\n", "# import classes from thinkbayes2\n", "from thinkbayes2 import Pmf, Suite\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import thinkplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interpreting medical tests\n", "\n", "Suppose you are a doctor treating a 40-year old female patient. After she gets a routine screening mammogram, the result comes back positive (defined below).\n", "\n", "The patient asks whether this result indicates that she has breast cancer. You interpret this question as, \"What is the probability that this patient has breast cancer, given a positive test result?\"\n", "\n", "How would you respond?\n", "\n", "The following background information from the Breast Cancer Screening Consortium (BCSC) might help:\n", "\n", "[Cancer Rate (per 1,000 examinations) and Cancer Detection Rate (per 1,000 examinations) for 1,838,372 Screening Mammography Examinations from 2004 to 2008 by Age -- based on BCSC data through 2009](http://www.bcsc-research.org/statistics/performance/screening/2009/rate_age.html).\n", "\n", "[Performance Measures for 1,838,372 Screening Mammography Examinations1 from 2004 to 2008 by Age -- based on BCSC data through 2009](http://www.bcsc-research.org/statistics/performance/screening/2009/perf_age.html)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "class BayesTable(pd.DataFrame):\n", " def __init__(self, hypo, prior=1, **options):\n", " columns = ['prior', 'likelihood', 'unnorm', 'posterior']\n", " super().__init__(index=hypo, columns=columns, **options)\n", " self.prior = prior\n", " \n", " def mult(self):\n", " self.unnorm = self.prior * self.likelihood\n", " \n", " def norm(self):\n", " nc = np.sum(self.unnorm)\n", " self.posterior = self.unnorm / nc\n", " return nc\n", " \n", " def update(self):\n", " self.mult()\n", " return self.norm()\n", " \n", " def reset(self):\n", " return BayesTable(self.hypo, self.posterior)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assumptions and interpretation\n", "\n", "According to [the first table](http://www.bcsc-research.org/statistics/performance/screening/2009/rate_age.html), the cancer rate per 1000 examinations is 2.65 for women age 40-44. The notes explain that this rate is based on \"the number of examinations with a tissue diagnosis of ductal carcinoma in situ or invasive cancer within 1 year following the examination and before the next screening mammography examination\", so it would be more precise to say that it is the rate of diagnosis within a year of the examination, not the rate of actual cancers.\n", "\n", "Since untreated invasive breast cancer is likely to become symptomatic, we expect a large fraction of cancers to be diagnosed eventually. But there might be a long delay between developing a cancer and diagnosis, and a patient might die of another cause before diagnosis. So we should consider this rate as a lower bound on the probability that a patient has cancer at the time of the examination.\n", "\n", "According to [the second table](http://www.bcsc-research.org/statistics/performance/screening/2009/perf_age.html), the sensitivity of the test for women in this age group is 73.4%; the specificity is 87.7%. From these, we can get the conditional probabilities:\n", "\n", "```\n", "P(positive test | cancer) = sensitivity\n", "P(positive test | no cancer) = (1 - specificity)\n", "```\n", "\n", "Now we can use a Bayes table to compute the probability we are interested in, `P(cancer | positive test)`" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priorlikelihoodunnormposterior
cancer0.00265NaNNaNNaN
no cancer0.99735NaNNaNNaN
\n", "
" ], "text/plain": [ " prior likelihood unnorm posterior\n", "cancer 0.00265 NaN NaN NaN\n", "no cancer 0.99735 NaN NaN NaN" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_rate = 2.65 / 1000\n", "hypo = ['cancer', 'no cancer']\n", "prior = [base_rate, 1-base_rate]\n", "table = BayesTable(hypo, prior)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priorlikelihoodunnormposterior
cancer0.002650.734NaNNaN
no cancer0.997350.123NaNNaN
\n", "
" ], "text/plain": [ " prior likelihood unnorm posterior\n", "cancer 0.00265 0.734 NaN NaN\n", "no cancer 0.99735 0.123 NaN NaN" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sensitivity = 0.734\n", "specificity = 0.877\n", "table.likelihood = [sensitivity, 1-specificity]\n", "table" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.967479674796748" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "likelihood_ratio = table.likelihood['cancer'] / table.likelihood['no cancer']" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priorlikelihoodunnormposterior
cancer0.002650.7340.0019450.015608
no cancer0.997350.1230.1226740.984392
\n", "
" ], "text/plain": [ " prior likelihood unnorm posterior\n", "cancer 0.00265 0.734 0.001945 0.015608\n", "no cancer 0.99735 0.123 0.122674 0.984392" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.update()\n", "table" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.5608355537652119" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.posterior['cancer'] * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there is a 1.56% chance that this patient has cancer, given that the initial screening mammogram was positive.\n", "\n", "This result is called the positive predictive value (PPV) of the test, which we could have read from [the second table](http://www.bcsc-research.org/statistics/performance/screening/2009/perf_age.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data was the basis, in 2009, for the recommendation of the US Preventive Services Task Force," ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def compute_ppv(base_rate, sensitivity, specificity):\n", " pmf = Pmf()\n", " pmf['cancer'] = base_rate * sensitivity\n", " pmf['no cancer'] = (1 - base_rate) * (1 - specificity)\n", " pmf.Normalize()\n", " return pmf" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pmf({'cancer': 0.01560835553765212, 'no cancer': 0.9843916444623478})" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pmf = compute_ppv(base_rate, sensitivity, specificity)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40 2.65\n", "50 4.28\n", "60 5.70\n", "70 6.76\n", "80 8.51\n", "dtype: float64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages = [40, 50, 60, 70, 80]\n", "rates = pd.Series([2.65, 4.28, 5.70, 6.76, 8.51], index=ages)\n", " " ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40 1.116493987314525\n", "50 1.1473441243499096\n", "60 1.1603294783259839\n", "70 1.166569488592548\n", "80 1.173548315582017\n" ] } ], "source": [ "for age, rate in rates.items():\n", " pmf = compute_ppv(rate, sensitivity, specificity)\n", " print(age, pmf['cancer'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }