{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Bayesian interpretation of medical tests\n", "-----------------------------------------\n", "\n", "This notebooks explores several problems related to interpreting the results of medical tests.\n", "\n", "Copyright 2016 Allen Downey\n", "\n", "MIT License: http://opensource.org/licenses/MIT" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import print_function, division\n", "\n", "from thinkbayes2 import Pmf, Suite\n", "\n", "from fractions import Fraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Medical tests\n", "\n", "Suppose we test a patient to see if they have a disease, and the test comes back positive. What is the probability that the patient is actually sick (that is, has the disease)?\n", "\n", "To answer this question, we need to know:\n", "\n", "* The prevalence of the disease in the population the patient is from. Let's assume the patient is identified as a member of a population where the known prevalence is `p`.\n", "\n", "* The sensitivity of the test, `s`, which is the probability of a positive test if the patient is sick.\n", "\n", "* The false positive rate of the test, `t`, which is the probability of a positive test if the patient is not sick.\n", "\n", "Given these parameters, we can compute the probability that the patient is sick, given a positive test.\n", "\n", "### Test class\n", "\n", "To do that, I'll define a `Test` class that extends `Suite`, so it inherits `Update` and provides `Likelihood`.\n", "\n", "The instance variables of `Test` are:\n", "\n", "* `p`, `s`, and `t`: Copies of the parameters.\n", "* `d`: a dictionary that maps from hypotheses to their probabilities. The hypotheses are the strings `sick` and `notsick`.\n", "* `likelihood`: a dictionary that encodes the likelihood of the possible data values `pos` and `neg` under the hypotheses.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class Test(Suite):\n", " \"\"\"Represents beliefs about a patient based on a medical test.\"\"\"\n", " \n", " def __init__(self, p, s, t, label='Test'):\n", " # initialize the prior probabilities\n", " d = dict(sick=p, notsick=1-p)\n", " super(Test, self).__init__(d, label)\n", " \n", " # store the parameters\n", " self.p = p\n", " self.s = s\n", " self.t = t\n", " \n", " # make a nested dictionary to compute likelihoods\n", " self.likelihood = dict(pos=dict(sick=s, notsick=t),\n", " neg=dict(sick=1-s, notsick=1-t))\n", " \n", " def Likelihood(self, data, hypo):\n", " \"\"\"\n", " data: 'pos' or 'neg'\n", " hypo: 'sick' or 'notsick'\n", " \"\"\"\n", " return self.likelihood[data][hypo]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create a `Test` object with parameters chosen for demonstration purposes (most medical tests are better than this!):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsick 9/10\n", "sick 1/10\n" ] } ], "source": [ "p = Fraction(1, 10) # prevalence\n", "s = Fraction(9, 10) # sensitivity\n", "t = Fraction(3, 10) # false positive rate\n", "\n", "test = Test(p, s, t)\n", "test.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are curious, here's the nested dictionary that computes the likelihoods:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'neg': {'notsick': Fraction(7, 10), 'sick': Fraction(1, 10)},\n", " 'pos': {'notsick': Fraction(3, 10), 'sick': Fraction(9, 10)}}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's how we update the `Test` object with a positive outcome:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsick 3/4\n", "sick 1/4\n" ] } ], "source": [ "test.Update('pos')\n", "test.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The positive test provides evidence that the patient is sick, increasing the probability from 0.1 to 0.25." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uncertainty about `t`\n", "\n", "So far, this is basic Bayesian inference. Now let's add a wrinkle. Suppose that we don't know the value of `t` with certainty, but we have reason to believe that `t` is either 0.2 or 0.4 with equal probability.\n", "\n", "Again, we would like to know the probability that a patient who tests positive actually has the disease. As we did with the Red Die problem, we will consider several scenarios:\n", "\n", "**Scenario A**: The patients are drawn at random from the relevant population, and the reason we are uncertain about `t` is that either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.\n", "\n", "**Scenario B**: As in Scenario A, the patients are drawn at random from the relevant population, but the reason we are uncertain about `t` is that previous studies of the test have been contradictory. That is, there is only one version of the test, and we have reason to believe that `t` is the same for all groups, but we are not sure what the correct value of `t` is.\n", "\n", "**Scenario C**: As in Scenario A, there are two versions of the test or two groups of people. But now the patients are being filtered so we only see the patients who tested positive and we don't know how many patients tested negative. For example, suppose you are a specialist and patients are only referred to you after they test positive.\n", "\n", "**Scenario D**: As in Scenario B, we have reason to think that `t` is the same for all patients, and as in Scenario C, we only see patients who test positive and don't know how many tested negative." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scenario A\n", "\n", "We can represent this scenario with a hierarchical model, where the levels of the hierarchy are:\n", "\n", "* At the top level, the possible values of `t` and their probabilities.\n", "* At the bottom level, the probability that the patient is sick or not, conditioned on `t`.\n", "\n", "To represent the hierarchy, I'll define a `MetaTest`, which is a `Suite` that contains `Test` objects with different values of `t` as hypotheses." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class MetaTest(Suite):\n", " \"\"\"Represents a set of tests with different values of `t`.\"\"\"\n", " \n", " def Likelihood(self, data, hypo):\n", " \"\"\"\n", " data: 'pos' or 'neg'\n", " hypo: Test object\n", " \"\"\"\n", " # the return value from `Update` is the total probability of the\n", " # data for a hypothetical value of `t`\n", " return hypo.Update(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To update a `MetaTest`, we update each of the hypothetical `Test` objects. The return value from `Update` is the normalizing constant, which is the total probability of the data under the hypothesis.\n", "\n", "We use the normalizing constants from the bottom level of the hierarchy as the likelihoods at the top level.\n", "\n", "Here's how we create the `MetaTest` for the scenario we described:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test(t=0.2) 1/2\n", "Test(t=0.4) 1/2\n" ] } ], "source": [ "q = Fraction(1, 2)\n", "t1 = Fraction(2, 10)\n", "t2 = Fraction(4, 10)\n", "\n", "test1 = Test(p, s, t1, 'Test(t=0.2)')\n", "test2 = Test(p, s, t2, 'Test(t=0.4)')\n", "\n", "metatest = MetaTest({test1:q, test2:1-q})\n", "metatest.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the top level, there are two tests, with different values of `t`. Initially, they are equally likely.\n", "\n", "When we update the `MetaTest`, it updates the embedded `Test` objects and then the `MetaTest` itself." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Fraction(9, 25)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metatest.Update('pos')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test(t=0.4) 5/8\n", "Test(t=0.2) 3/8\n" ] } ], "source": [ "metatest.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because a positive test is more likely if `t=0.4`, the positive test is evidence in favor of the hypothesis that `t=0.4`.\n", "\n", "This `MetaTest` object represents what we should believe about `t` after seeing the test, as well as what we should believe about the probability that the patient is sick." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Marginal distributions\n", "\n", "To compute the probability that the patient is sick, we have to compute the marginal probabilities of `sick` and `notsick`, averaging over the possible values of `t`. The following function computes this distribution:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def MakeMixture(metapmf, label='mix'):\n", " \"\"\"Make a mixture distribution.\n", "\n", " Args:\n", " metapmf: Pmf that maps from Pmfs to probs.\n", " label: string label for the new Pmf.\n", "\n", " Returns: Pmf object.\n", " \"\"\"\n", " mix = Pmf(label=label)\n", " for pmf, p1 in metapmf.Items():\n", " for x, p2 in pmf.Items():\n", " mix.Incr(x, p1 * p2)\n", " return mix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the posterior predictive distribution:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsick 3/4\n", "sick 1/4\n" ] } ], "source": [ "predictive = MakeMixture(metatest)\n", "predictive.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After seeing the test, the probability that the patient is sick is 0.25, which is the same result we got with `t=0.3`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two patients\n", "\n", "Now suppose you test two patients and they both test positive. What is the probability that they are both sick?\n", "\n", "To answer that, I define a few more functions to work with Metatests:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def MakeMetaTest(p, s, pmf_t):\n", " \"\"\"Makes a MetaTest object with the given parameters.\n", " \n", " p: prevalence\n", " s: sensitivity\n", " pmf_t: Pmf of possible values for `t`\n", " \"\"\"\n", " tests = {}\n", " for t, q in pmf_t.Items():\n", " label = 'Test(t=%s)' % str(t)\n", " tests[Test(p, s, t, label)] = q\n", " return MetaTest(tests)\n", "\n", "def Marginal(metatest):\n", " \"\"\"Extracts the marginal distribution of t.\n", " \"\"\"\n", " marginal = Pmf()\n", " for test, prob in metatest.Items():\n", " marginal[test.t] = prob\n", " return marginal\n", "\n", "def Conditional(metatest, t):\n", " \"\"\"Extracts the distribution of sick/notsick conditioned on t.\"\"\"\n", " for test, prob in metatest.Items():\n", " if test.t == t:\n", " return test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`MakeMetaTest` makes a `MetaTest` object starting with a given PMF of `t`.\n", "\n", "`Marginal` extracts the PMF of `t` from a `MetaTest`.\n", "\n", "`Conditional` takes a specified value for `t` and returns the PMF of `sick` and `notsick` conditioned on `t`.\n", "\n", "I'll test these functions using the same parameters from above:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test(t=1/5) 1/2\n", "Test(t=2/5) 1/2\n" ] } ], "source": [ "pmf_t = Pmf({t1:q, t2:1-q})\n", "metatest = MakeMetaTest(p, s, pmf_t)\n", "metatest.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test(t=2/5) 5/8\n", "Test(t=1/5) 3/8\n" ] } ], "source": [ "metatest = MakeMetaTest(p, s, pmf_t)\n", "metatest.Update('pos')\n", "metatest.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same as before. Now we can extract the posterior distribution of `t`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/5 3/8\n", "2/5 5/8\n" ] } ], "source": [ "Marginal(metatest).Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having seen one positive test, we are a little more inclined to believe that `t=0.4`; that is, that the false positive rate for this patient/test is high.\n", "\n", "And we can extract the conditional distributions for the patient:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsick 2/3\n", "sick 1/3\n" ] } ], "source": [ "cond1 = Conditional(metatest, t1)\n", "cond1.Print()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsick 4/5\n", "sick 1/5\n" ] } ], "source": [ "cond2 = Conditional(metatest, t2)\n", "cond2.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can make the posterior marginal distribution of sick/notsick, which is a weighted mixture of the conditional distributions:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsick 3/4\n", "sick 1/4\n" ] } ], "source": [ "MakeMixture(metatest).Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point we have a `MetaTest` that contains our updated information about the test (the distribution of `t`) and about the patient that tested positive.\n", "\n", "Now, to compute the probability that both patients are sick, we have to know the distribution of `t` for both patients. And that depends on details of the scenario.\n", "\n", "In Scenario A, the reason we are uncertain about `t` is either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.\n", "\n", "So the value of `t` for each patient is an independent choice from `pmf_t`; that is, if we learn something about `t` for one patient, that tells us nothing about `t` for other patients.\n", "\n", "So if we consider two patients who have tested positive, the MetaTest we just computed represents our belief about each of the two patients independently.\n", "\n", "To compute the probability that both patients are sick, we can convolve the two distributions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pmf({'notsicksick': Fraction(4, 25), 'sicknotsick': Fraction(4, 25), 'notsicknotsick': Fraction(16, 25), 'sicksick': Fraction(1, 25)}) 25/64\n", "Pmf({'notsicksick': Fraction(4, 15), 'sicknotsick': Fraction(2, 15), 'notsicknotsick': Fraction(8, 15), 'sicksick': Fraction(1, 15)}) 15/64\n", "Pmf({'notsicksick': Fraction(2, 15), 'sicknotsick': Fraction(4, 15), 'notsicknotsick': Fraction(8, 15), 'sicksick': Fraction(1, 15)}) 15/64\n", "Pmf({'notsicksick': Fraction(2, 9), 'sicknotsick': Fraction(2, 9), 'notsicknotsick': Fraction(4, 9), 'sicksick': Fraction(1, 9)}) 9/64\n" ] } ], "source": [ "convolution = metatest + metatest\n", "convolution.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can compute the posterior marginal distribution of sick/notsick for the two patients:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsicknotsick 9/16\n", "notsicksick 3/16\n", "sicknotsick 3/16\n", "sicksick 1/16\n" ] } ], "source": [ "marginal = MakeMixture(metatest+metatest)\n", "marginal.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in Scenario A the probability that both patients are sick is 1/16.\n", "\n", "As an aside, we could have computed the marginal distributions first and then convolved them, which is computationally more efficient:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "notsicknotsick 9/16\n", "notsicksick 3/16\n", "sicknotsick 3/16\n", "sicksick 1/16\n" ] } ], "source": [ "marginal = MakeMixture(metatest) + MakeMixture(metatest)\n", "marginal.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can confirm that this result is correct by simulation. Here's a generator that generates random pairs of patients:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from random import random\n", "\n", "def flip(p):\n", " return random() < p\n", "\n", "def generate_pair_A(p, s, pmf_t):\n", " while True:\n", " sick1, sick2 = flip(p), flip(p)\n", " \n", " t = pmf_t.Random()\n", " test1 = flip(s) if sick1 else flip(t)\n", "\n", " t = pmf_t.Random()\n", " test2 = flip(s) if sick2 else flip(t)\n", "\n", " yield test1, test2, sick1, sick2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's a function that runs the simulation for a given number of iterations:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def run_simulation(generator, iters=100000):\n", " pmf_t = Pmf([0.2, 0.4])\n", " pair_iterator = generator(0.1, 0.9, pmf_t)\n", "\n", " outcomes = Pmf()\n", " for i in range(iters):\n", " test1, test2, sick1, sick2 = next(pair_iterator)\n", " if test1 and test2:\n", " outcomes[sick1, sick2] += 1\n", "\n", " outcomes.Normalize()\n", " return outcomes" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(False, False) 0.5703058787271746\n", "(False, True) 0.1794437167732491\n", "(True, False) 0.1884582787579937\n", "(True, True) 0.06179212574158256\n" ] } ], "source": [ "outcomes = run_simulation(generate_pair_A)\n", "outcomes.Print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we increase `iters`, the probablity of (True, True) converges on 1/16, which is what we got from the analysis.\n", "\n", "Good so far!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }