{ "cells": [ { "cell_type": "markdown", "id": "726afc96", "metadata": {}, "source": [ "# Bayesian Network Example with `pgmpy`, `pomegranate`, and `bnlearn`" ] }, { "cell_type": "markdown", "id": "74fa9e85", "metadata": {}, "source": [ "Starting with `pgmpy` (probabilistic graphical models in python), we'll do some simple Bayesian Networks. \n", "\n", "This demo uses the `TabularCPD` object to create tables of Conditional Probability Distributions (CPD). Everything should be self-explanatory and well-documented in the help, but here's some that I didn't understand at first glance:\n", "* `variable_card` = variable cardinality, i.e. the number of states the variable can take\n", "* `evidence` = list of variable *names*\n", "* `evidence_card` = cardinality of the evidence, should be a list" ] }, { "cell_type": "markdown", "id": "a42f03fa", "metadata": {}, "source": [ ">In pgmpy we define the network structure and the CPDs (conditional probability distributions) separately and then associate them with the structure. Here’s an example for defining the above model:" ] }, { "cell_type": "code", "execution_count": 1, "id": "e13176a0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "7005e698", "metadata": {}, "outputs": [], "source": [ "from pgmpy.models import BayesianNetwork\n", "from pgmpy.factors.discrete import TabularCPD\n", "from pgmpy.inference import VariableElimination" ] }, { "cell_type": "markdown", "id": "ad5f6ae8", "metadata": {}, "source": [ "## Example 1\n", "\n", "This example comes from the excellent and short course by [Phillip Loick on Bayesian Statistics (Udemy)](https://www.udemy.com/course/bayesian-statistics/):" ] }, { "cell_type": "markdown", "id": "43cdda72", "metadata": {}, "source": [ "![fig2.png](fig2.png)" ] }, { "cell_type": "markdown", "id": "47a85b25", "metadata": {}, "source": [ "This is a fun model that models whether or not two people (John and Kate) will run based on the temperature and whether or not these two will meet. " ] }, { "cell_type": "code", "execution_count": 3, "id": "9aa4db63", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 1. Instantiate the model with node edges as a list\n", "model = BayesianNetwork([('T', 'J'), ('T', 'K'), ('J', 'M'), ('K', 'M')])\n", "\n", "# 2. Define the distributions\n", "cpd_t = TabularCPD(variable='T', variable_card=2, values=[[0.4], [0.6]], state_names={'T':['low', 'high']})\n", "cpd_j = TabularCPD(variable='J', variable_card=2, \n", " values=[[0.5, 0.7], \n", " [0.5, 0.3]], \n", " evidence=['T'],\n", " evidence_card=[2],\n", " state_names={'J':['yes', 'no'],\n", " 'T':['low', 'high']})\n", "cpd_k = TabularCPD(variable='K', variable_card=2, \n", " values=[[0.4, 0.75], \n", " [0.6, 0.25]], \n", " evidence=['T'],\n", " evidence_card=[2],\n", " state_names={'K':['yes', 'no'],\n", " 'T':['low', 'high']})\n", "cpd_m = TabularCPD(variable='M', variable_card = 2,\n", " values=[[.5, 0, 0, 0],\n", " [.5, 1, 1, 1]],\n", " evidence=['J', 'K'],\n", " evidence_card=[2, 2],\n", " state_names={'M': ['yes', 'no'],\n", " 'J': ['yes', 'no'],\n", " 'K': ['yes', 'no']})\n", "\n", "# 3. Add CPDs to the model\n", "model.add_cpds(cpd_t, cpd_j, cpd_k, cpd_m)\n", "# 4. Check the model validity (i.e. probabilities all sum to 1)\n", "model.check_model()" ] }, { "cell_type": "markdown", "id": "37dc4ae4", "metadata": {}, "source": [ "Let's doubleclick into the 'M' node because the evidence table looks **different** than the conditional probability table that was in the first figure:" ] }, { "cell_type": "code", "execution_count": 4, "id": "1a3b3f58", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+--------+--------+--------+-------+\n", "| J | J(yes) | J(yes) | J(no) | J(no) |\n", "+--------+--------+--------+--------+-------+\n", "| K | K(yes) | K(no) | K(yes) | K(no) |\n", "+--------+--------+--------+--------+-------+\n", "| M(yes) | 0.5 | 0.0 | 0.0 | 0.0 |\n", "+--------+--------+--------+--------+-------+\n", "| M(no) | 0.5 | 1.0 | 1.0 | 1.0 |\n", "+--------+--------+--------+--------+-------+\n" ] } ], "source": [ "# Printing a CPD with it's state names defined.\n", "print(model.get_cpds('M'))" ] }, { "cell_type": "markdown", "id": "b6f7f89f", "metadata": {}, "source": [ "This is because `pgmpy` expects you to have the `variable` states on rows and then like a multi-index of `evidence` on columns. Recall the `evidence` and `evidence_card` items we called out earlier as `['J', 'K']` in the `TabularCPD` call for variable `M`, so the columns show up in the order of the `evidence` and the `state_names` for each." ] }, { "cell_type": "markdown", "id": "38661aa8", "metadata": {}, "source": [ "## Inference with Variable Elimination" ] }, { "cell_type": "markdown", "id": "35661af2", "metadata": {}, "source": [ "Let’s take an example of inference using Variable Elimination in `pgmpy`. Here we'll use our model to compute the probability distribution that the two meet, and `pgmpy` will make a pretty table: \n", "\n", "In other words, what is the probability that they meet $P(M)=?$" ] }, { "cell_type": "code", "execution_count": 5, "id": "a0efc9ac", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "53a743b5b6f3456abe871c889db86218", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "T.plot()" ] }, { "cell_type": "markdown", "id": "cec0f78e", "metadata": {}, "source": [ "A few other useful methods:\n", "\n", "* `sample(n)`: draw `n` samples from the distribution\n", "* `probability(X)`: predict the probability of X under this distribution" ] }, { "cell_type": "code", "execution_count": 18, "id": "210e7417", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['low', 'high', 'high'], dtype='" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 1. Instantiate the BayesianNetwork model\n", "model = pm.BayesianNetwork(\"Calculating Posterior\")\n", "\n", "# 2. Add more distributions\n", "K = pm.ConditionalProbabilityTable(\n", " table=[['low', 'yes', 0.4],\n", " ['low', 'no', 0.6],\n", " ['high', 'yes', 0.75],\n", " ['high', 'no', 0.25]], parents=[T])\n", "M = pm.ConditionalProbabilityTable(\n", " table=[['yes', 'yes', 'yes', 0.5],\n", " ['yes', 'yes', 'no', 0.5],\n", " ['yes', 'no', 'yes', 0],\n", " ['yes', 'no', 'no', 1],\n", " ['no', 'yes', 'yes', 0],\n", " ['no', 'yes', 'no', 1],\n", " ['no', 'no', 'yes', 0],\n", " ['no', 'no', 'no', 1]], parents=[J, K])\n", "\n", "# 3. Define nodes in our network that follow these distributions\n", "n0 = pm.Node(T, name='Temperature')\n", "n1 = pm.Node(J, name='John')\n", "n2 = pm.Node(K, name='Kate')\n", "n3 = pm.Node(M, name='Meet')\n", "model.add_states(n0, n1, n2, n3)\n", "\n", "# 4. Define the Edges for each Node in our model\n", "model.add_edge(n0, n1)\n", "model.add_edge(n0, n2)\n", "model.add_edge(n1, n3)\n", "model.add_edge(n2, n3)\n", "\n", "# 5. Bake the model\n", "model.bake()\n", "\n", "# Optional: Plot\n", "model.plot()" ] }, { "cell_type": "markdown", "id": "949e4c19", "metadata": {}, "source": [ "I like the ability to plot (requires `matplotlib` and `pygraphviz`). Great feature!" ] }, { "cell_type": "markdown", "id": "a8214e43", "metadata": {}, "source": [ "## Inference\n", "\n", "We would like to do a simple inference - what's the probability that they meet?\n", "\n", "In other words, what is the probability that they meet $P(M=\\text{'yes'})$?\n", "\n", "In `pomegranate` you can do this by specifying all of the probabilities like such:" ] }, { "cell_type": "code", "execution_count": 22, "id": "048ed4f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.19749999999999995" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.probability([['high', 'yes', 'yes', 'yes'],\n", " ['high', 'yes', 'no', 'yes'],\n", " ['high', 'no', 'yes', 'yes'], \n", " ['high', 'no', 'no', 'yes'],\n", " ['low', 'yes', 'yes', 'yes'],\n", " ['low', 'yes', 'no', 'yes'],\n", " ['low', 'no', 'yes', 'yes'], \n", " ['low', 'no', 'no', 'yes'], \n", " ]).sum()" ] }, { "cell_type": "markdown", "id": "38f5a580", "metadata": {}, "source": [ "There is also a `marginal` method to get the marginal distribution of each of the variables:" ] }, { "cell_type": "code", "execution_count": 23, "id": "dce7c67a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4,)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.marginal().shape" ] }, { "cell_type": "code", "execution_count": 24, "id": "740191e3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{\n", " \"class\" : \"Distribution\",\n", " \"dtype\" : \"str\",\n", " \"name\" : \"DiscreteDistribution\",\n", " \"parameters\" : [\n", " {\n", " \"yes\" : 0.18910000000000005,\n", " \"no\" : 0.8109000000000001\n", " }\n", " ],\n", " \"frozen\" : false\n", "}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.marginal()[3]" ] }, { "cell_type": "markdown", "id": "844815e1", "metadata": {}, "source": [ "Interesting that the marginal distribution is different than what I had calculated manually! There must be a bug in my model somewhere..." ] }, { "cell_type": "markdown", "id": "4bf8fc22", "metadata": {}, "source": [ "## Impressions\n", "\n", "It looks like it's still too early for `pomegranate` when it comes to Bayesian Networks when compared to `pgmpy`. It felt still very much WIP to use `pomegranate` with docstrings and help/error messages seem incomplete, while `pgmpy` has extensive documentation. " ] }, { "cell_type": "markdown", "id": "f7f7e927", "metadata": {}, "source": [ "# Classic Disease Model with `pgmpy` and `bnlearn`" ] }, { "cell_type": "markdown", "id": "107cbde0", "metadata": {}, "source": [ "For this next example, we'll use the classic Disease model: \n", "\n", "Let's say that there's a disease that affects 2% of the population, and you have a 95% chance of testing positive (correctly) if you have the disease, and a 10% change of testing positive if you don't have the disease. What's the probability that you have the disease, given that you tested positive?\n", "\n", "The Conditional Probability Table would look like:\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "\t \n", "\t \n", " \n", " \n", " \n", " \n", "\t \n", " \n", " \n", " \n", "\n", "
Test Result
PositiveNegative
DiseaseYes0.95.05
No0.900.1
" ] }, { "cell_type": "markdown", "id": "0b2a1b7c", "metadata": {}, "source": [ "An alternative to `pgmpy` is to use `bnlearn` which extends some of the features from `pgmpy` and just...does more. It can even use `pgmpy` objects like `TabularCPD`, but it looks like there are lot more convenience functions included in it to help learn from data (including the ability to handle dataframes)." ] }, { "cell_type": "code", "execution_count": 25, "id": "5c06ec3c", "metadata": {}, "outputs": [], "source": [ "# Import the library\n", "import bnlearn as bn" ] }, { "cell_type": "markdown", "id": "22f93dc1", "metadata": {}, "source": [ "`bnlearn.make_DAG` takes in an argument `DAG` which should be your list of edges, and a `CPD` argument for your conditional probability distributions that were created with `pgmpy.TabularCPD`. \n", "\n", "For fun, we'll create this model with two tests ($T1$, $T2$) so we can look at statistics if you get multiple tests." ] }, { "cell_type": "code", "execution_count": 26, "id": "505599fd", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >bayes DAG created.\n", "[bnlearn] >Add CPD: D\n", "[bnlearn] >Add CPD: T1\n", "[bnlearn] >Add CPD: T2\n", "[bnlearn] >Checking CPDs..\n", "[bnlearn] >Check for DAG structure. Correct: True\n" ] } ], "source": [ "# Define the network structure\n", "edges = [('D', 'T1'), ('D', 'T2')]\n", "\n", "d = TabularCPD(variable='D', variable_card=2, values=[[0.02], [0.98]],\n", " state_names={'D':['yes', 'no']})\n", "t1 = TabularCPD(variable='T1', variable_card=2, \n", " values=[[0.95, 0.1],\n", " [0.05, 0.9]],\n", " evidence=['D'],\n", " evidence_card=[2],\n", " state_names={'D':['yes', 'no'],\n", " 'T1':['positive', 'negative']})\n", "t2 = TabularCPD(variable='T2', variable_card=2, \n", " values=[[0.95, 0.1],\n", " [0.05, 0.9]],\n", " evidence=['D'],\n", " evidence_card=[2],\n", " state_names={'D':['yes', 'no'],\n", " 'T2':['positive', 'negative']})\n", "\n", "# Make the actual Bayesian DAG with the previously defined CPD's \n", "model = bn.make_DAG(DAG=edges, CPD=[d, t1, t2])" ] }, { "cell_type": "code", "execution_count": 27, "id": "85761655", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn]> Set node properties.\n", "[bnlearn]> Set edge properties.\n", "[bnlearn] >Plot based on Bayesian model\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Set interactive=False for web publishing\n", "# If you have `pyviz` installed you can get an interactive graph!\n", "bn.plot(model, title='Disease Test', interactive=False);" ] }, { "cell_type": "markdown", "id": "b3caf7b0", "metadata": {}, "source": [ "There's also a handy `print_CPD` method that lets you pretty print all the CPD's." ] }, { "cell_type": "code", "execution_count": 28, "id": "fcdc123d", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPD of D:\n", "+--------+------+\n", "| D(yes) | 0.02 |\n", "+--------+------+\n", "| D(no) | 0.98 |\n", "+--------+------+\n", "CPD of T1:\n", "+--------------+--------+-------+\n", "| D | D(yes) | D(no) |\n", "+--------------+--------+-------+\n", "| T1(positive) | 0.95 | 0.1 |\n", "+--------------+--------+-------+\n", "| T1(negative) | 0.05 | 0.9 |\n", "+--------------+--------+-------+\n", "CPD of T2:\n", "+--------------+--------+-------+\n", "| D | D(yes) | D(no) |\n", "+--------------+--------+-------+\n", "| T2(positive) | 0.95 | 0.1 |\n", "+--------------+--------+-------+\n", "| T2(negative) | 0.05 | 0.9 |\n", "+--------------+--------+-------+\n", "[bnlearn] >Independencies:\n", "(T2 ⟂ T1 | D)\n", "(T1 ⟂ T2 | D)\n", "[bnlearn] >Nodes: ['D', 'T1', 'T2']\n", "[bnlearn] >Edges: [('D', 'T1'), ('D', 'T2')]\n" ] } ], "source": [ "bn.print_CPD(model)" ] }, { "cell_type": "markdown", "id": "7604b7cc", "metadata": {}, "source": [ "### Inference\n", "\n", "You can also use `bnlearn.inference` to perform inference using Variable Elimination, similar to what we saw in `pgmpy`:" ] }, { "cell_type": "code", "execution_count": 29, "id": "2ee90ac5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f71bd228c62148268e82b961eb89c365", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "12c596bcd25c49da86da1815db84d766", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+----------+\n", "| | D | p |\n", "+====+=====+==========+\n", "| 0 | 0 | 0.162393 |\n", "+----+-----+----------+\n", "| 1 | 1 | 0.837607 |\n", "+----+-----+----------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive'})" ] }, { "cell_type": "markdown", "id": "aa7ba68c", "metadata": {}, "source": [ "*Note: Interesting that it doesn't show the 'state' names here, so the above result is 'yes' for 0 and 'no' for 1 because that's the order in which we defined the states.*\n", "\n", "Surprisingly, with the given assumptions, with a single positive test there would only be a 16% probability that you'd have the disease! What's it look like if you had two positive tests?" ] }, { "cell_type": "code", "execution_count": 30, "id": "65c03a34", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c2b1a32bf04547119ef10418ea6dcb8d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "67023c90e90c4d3381cc2d1003fe9064", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+----------+\n", "| | D | p |\n", "+====+=====+==========+\n", "| 0 | 0 | 0.648115 |\n", "+----+-----+----------+\n", "| 1 | 1 | 0.351885 |\n", "+----+-----+----------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive', 'T2':'positive'})" ] }, { "cell_type": "markdown", "id": "cac16449", "metadata": {}, "source": [ "So we see here that there's a 65% probability of having this disease after testing positive *twice*. A pretty dramatic improvement from the 16% earlier. But still, this is really low!" ] }, { "cell_type": "markdown", "id": "554877c9", "metadata": {}, "source": [ "## What about a real-world example with Covid testing?\n", "\n", "The above example uses numbers to help surprise the learner about how counterintuitive statistics can be. But should we be skeptical of *any* test, especially in 2022?\n", "\n", "If we change the prevalence and sensitivity/specificity to, say, real-world numbers for Covid-19, the numbers become more reassuring.\n", "\n", "We can check real world numbers thanks to resources like:\n", "* [FDA Emergency Use Authorization statistics for Covid Tests](https://www.fda.gov/medical-devices/coronavirus-disease-2019-covid-19-emergency-use-authorizations-medical-devices/eua-authorized-serology-test-performance)\n", "* [CDC Covid Data Tracker for disease prevalence](https://covid.cdc.gov/covid-data-tracker/#datatracker-home)" ] }, { "cell_type": "markdown", "id": "7221e59c", "metadata": {}, "source": [ "### Estimating disease prevalence\n", "\n", "You can use the [CDC Covid Tracker](https://covid.cdc.gov/covid-data-tracker/#datatracker-home) to get a sense of positive cases per 100K people. Let's pull the data for Multnomah County, host of the fine city of Portland, Oregon:\n", "\n", "|![covid_prevalence](covid_prevalence.png)|\n", "|:---:|\n", "|Source: [CDC Covid Tracker](https://covid.cdc.gov/covid-data-tracker/#datatracker-home)|\n" ] }, { "cell_type": "markdown", "id": "03b32cab", "metadata": {}, "source": [ "Let's say that this published rate only captures 1/4 of actual cases, due to folks not reporting (i.e. people testing at home, people who are asymptomatic and don't get tested, etc), then the true prevalence would be about 2.6%:" ] }, { "cell_type": "code", "execution_count": 31, "id": "c182253d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.025874400000000002" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(646.86 * 4)/100000" ] }, { "cell_type": "markdown", "id": "3ffa4c08", "metadata": {}, "source": [ "### FDA data on covid test performance\n", "\n", "You can also get sensitivity and specificity data on the covid tests that the FDA approved for emergency use in the early days of the pandemic: \n", "\n", "Here we'll use data on the Abbott AdviseDx SARS-CoV-2 IgG II (Alinity), which has a sensitivity of 98.1% (51/52), specificity of 99.6% (2000/2008). Because of the way that we've formulated this problem, we can substitute the following:\n", "\n", "$$\n", "\\begin{aligned}\n", "\\text{Sensitivity} &= P(D=\\text{yes}|T=\\text{'positive'}) = 51/52 = 98.0769\\%\\\\\n", "\\text{Specificity} &= P(D=\\text{no}|T=\\text{'negative'}) = 2000/2008 = 99.6016\\%\\\\\n", "\\end{aligned}\n", "$$" ] }, { "cell_type": "markdown", "id": "e7f4fafd", "metadata": {}, "source": [ "Or as a conditional probability table:\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "\t \n", "\t \n", " \n", " \n", " \n", " \n", "\t \n", " \n", " \n", " \n", "\n", "
Test Result
PositiveNegative
DiseaseYes0.9810.019
No0.0040.996
" ] }, { "cell_type": "markdown", "id": "81efc987", "metadata": {}, "source": [ "For convenience, from here on out we'll redefine this as a function:" ] }, { "cell_type": "code", "execution_count": 32, "id": "d18b7c0e", "metadata": {}, "outputs": [], "source": [ "def disease_test(prevalence, sens, spec):\n", " \"\"\"\n", " Parameters\n", " ----------\n", " prevalence : float\n", " Estimated percent of population that has the disease\n", " sens : float\n", " Sensitivity, or true positive rate\n", " spec : float\n", " Specificity, or true negative rate\n", " \"\"\"\n", " edges = [('D', 'T1'), ('D', 'T2')]\n", "\n", " d = TabularCPD(variable='D', variable_card=2, values=[[prevalence], [1-prevalence]],\n", " state_names={'D':['yes', 'no']})\n", " t1 = TabularCPD(variable='T1', variable_card=2, \n", " values=[[sens, 1-spec],\n", " [1-sens, spec]],\n", " evidence=['D'],\n", " evidence_card=[2],\n", " state_names={'D':['yes', 'no'],\n", " 'T1':['positive', 'negative']})\n", " t2 = TabularCPD(variable='T2', variable_card=2, \n", " values=[[sens, 1-spec],\n", " [1-sens, spec]],\n", " evidence=['D'],\n", " evidence_card=[2],\n", " state_names={'D':['yes', 'no'],\n", " 'T2':['positive', 'negative']})\n", " # Make the actual Bayesian DAG with the previously defined CPD's \n", " model = bn.make_DAG(DAG=edges, CPD=[d, t1, t2], verbose=1) # quiet those messages!\n", " return model" ] }, { "cell_type": "code", "execution_count": 33, "id": "ee07c5b7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fdf72908948d4f5b993b90bde28f9ce8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a759925e004e49a6874ebb8d9f733e7c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+----------+\n", "| | D | p |\n", "+====+=====+==========+\n", "| 0 | 0 | 0.867923 |\n", "+----+-----+----------+\n", "| 1 | 1 | 0.132077 |\n", "+----+-----+----------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = disease_test(0.026, sens=51/52, spec=2000/2008)\n", "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive'})" ] }, { "cell_type": "markdown", "id": "7225da2b", "metadata": {}, "source": [ "Reassuringly, the probability of having the disease is about 87%, after having tested positive with this test (and assuming prevalence of 2.6%)." ] }, { "cell_type": "code", "execution_count": 34, "id": "94ea9ce6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3690e9e4f79d4f8f92b703fde7c5f139", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "85873fbc1de74c95a48a19c328c86c87", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+-------------+\n", "| | D | p |\n", "+====+=====+=============+\n", "| 0 | 0 | 0.999382 |\n", "+----+-----+-------------+\n", "| 1 | 1 | 0.000617783 |\n", "+----+-----+-------------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive', 'T2':'positive'})" ] }, { "cell_type": "markdown", "id": "2bfe27a8", "metadata": {}, "source": [ "And after 2 tests, the probability rises to 99.9%. Not bad, right?\n", "\n", "Now, let's take a quick look with a rapid at-home test like the Abbot BinaxNOW test. There seems to be some conflicting data on this (i.e. the results are better if you're symptomatic), but we can try to use some available overall numbers. Skimming the results of Table 2 from [this Nov 2020 study](https://www.cdc.gov/mmwr/volumes/70/wr/mm7003e3.htm) gives us a Sensitivity of 52.5% and Specificity of 99.9%:\n", "\n", "|
|\n", "|:---:|\n", "|![binax_sens_spec](binax_sens_spec.png)|\n", "| From: [Prince-Guerra JL, Almendares O, Nolen LD, et al. Evaluation of Abbott BinaxNOW Rapid Antigen Test for SARS-CoV-2 Infection at Two Community-Based Testing Sites — Pima County, Arizona, November 3–17, 2020. MMWR Morb Mortal Wkly Rep 2021](https://www.cdc.gov/mmwr/volumes/70/wr/mm7003e3.htm)|" ] }, { "cell_type": "code", "execution_count": 35, "id": "43b56ba8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c667a113980c4dffb478d6a9e27a7771", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ec3023826c464b6cbfe49f3919427c45", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+-----------+\n", "| | D | p |\n", "+====+=====+===========+\n", "| 0 | 0 | 0.933397 |\n", "+----+-----+-----------+\n", "| 1 | 1 | 0.0666028 |\n", "+----+-----+-----------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = disease_test(0.026, sens=0.525, spec=0.999)\n", "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive'})" ] }, { "cell_type": "markdown", "id": "89c118ce", "metadata": {}, "source": [ "So that's *quite* good, considering the study found that sensitivity was much higher among symptomatic patients. We can also take a look if you take two tests:" ] }, { "cell_type": "code", "execution_count": 36, "id": "c61c06ae", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3ee1cb958bd9490fa0ae9b2674c6f063", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9467ad9668bf408f89698fd0592949b4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+-------------+\n", "| | D | p |\n", "+====+=====+=============+\n", "| 0 | 0 | 0.999864 |\n", "+----+-----+-------------+\n", "| 1 | 1 | 0.000135896 |\n", "+----+-----+-------------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive', 'T2':'positive'})" ] }, { "cell_type": "markdown", "id": "8ff65fbd", "metadata": {}, "source": [ "And now there's near certainty with two tests. What if you had one test positive and another negative, out of curiosity?" ] }, { "cell_type": "code", "execution_count": 37, "id": "d21449b2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[bnlearn] >Variable Elimination..\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3acc19dbbb67438585bef61421861e53", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0eba0daac30f4fef9cd28aba4105733b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+----------+\n", "| | D | p |\n", "+====+=====+==========+\n", "| 0 | 0 | 0.869511 |\n", "+----+-----+----------+\n", "| 1 | 1 | 0.130489 |\n", "+----+-----+----------+\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn.inference.fit(model, variables=['D'], evidence={'T1':'positive', 'T2':'negative'})" ] }, { "cell_type": "markdown", "id": "64667c1d", "metadata": {}, "source": [ "## References\n", "\n", "* [Phillip Loick on Bayesian Statistics (Udemy)](https://www.udemy.com/course/bayesian-statistics/)\n", "* `bnlearn`: https://github.com/erdogant/bnlearn, much newer and very promising \n", "* `pymc` thread on bayes nets: https://discourse.pymc.io/t/bayes-nets-belief-networks-and-pymc/5150/8\n", "* [FDA Emergency Use Authorization statistics for Covid Tests](https://www.fda.gov/medical-devices/coronavirus-disease-2019-covid-19-emergency-use-authorizations-medical-devices/eua-authorized-serology-test-performance)\n", "* [CDC Covid Data Tracker for disease prevalence](https://covid.cdc.gov/covid-data-tracker/#datatracker-home)\n", "* [Prince-Guerra JL, Almendares O, Nolen LD, et al. Evaluation of Abbott BinaxNOW Rapid Antigen Test for SARS-CoV-2 Infection at Two Community-Based Testing Sites — Pima County, Arizona, November 3–17, 2020. MMWR Morb Mortal Wkly Rep 2021](https://www.cdc.gov/mmwr/volumes/70/wr/mm7003e3.htm)" ] }, { "cell_type": "code", "execution_count": null, "id": "5c78553a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }