{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian Statistics\n", "\n", "> Marcos Duarte \n", "> [Laboratory of Biomechanics and Motor Control](http://pesquisa.ufabc.edu.br/bmclab/) \n", "> Federal University of ABC, Brazil" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From [Wikipedia](https://en.wikipedia.org/wiki/Bayesian_statistics):\n", " \n", "> Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a degree of belief in an event. The degree of belief may be based on prior knowledge about the event, such as the results of previous experiments, or on personal beliefs about the event. \n", "This differs from a number of other interpretations of probability, such as the frequentist interpretation that views probability as the limit of the relative frequency of an event after a large number of trials.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayes' theorem\n", "\n", "Given two events $A$ and $B$, the conditional probability of $A$ given that $B$ is true is expressed by the Bayes' theorem:\n", "\n", "$$ P(A \\mid B) = \\frac{P(B \\mid A)P(A)}{P(B)} $$\n", "\n", "where $P(B) \\neq 0$. \n", "\n", "$P(A)$ is the prior probability of $A$ which expresses one's beliefs about $A$ before evidence is taken into account; \n", "$P(B)$ is the probability of the evidence $B$; \n", "$P(B∣A)$ is the likelihood function, which can be interpreted as the probability of the evidence $B$ given that $A$ is true. The likelihood quantifies the extent to which the evidence $B$ supports the proposition $A$; \n", "$P(A∣B)$ is the posterior probability, the probability of the proposition $A$ after taking the evidence $B$ into account." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Probability of having a disease after a positive result\n", "\n", "This example is taken from [https://youtu.be/R13BD8qKeTg](https://youtu.be/R13BD8qKeTg). \n", "\n", "Suppose you test positive for a rare disease, with 0.1% of frequency of occurrence in the population. \n", "You investigate more about the test for the disease and you found out that the employed test correctly identifies as positive 99% of the people who have the disease (true positives) and incorrectly identifies as positive 1% of the people who don't have the disease (false positives). \n", "\n", "How certain you actually have the disease? \n", "\n", "Let's use Bayes' theorem to answer this problem:\n", "\n", "The two events are: having the disease ($A$) and the positive result ($B$). \n", "The question is: which is the probability, $P(A∣B)$, of actually having the disease given a positive result of the test for the disease. \n", "$P(A)$ is the prior probability, the probability of having the disease before taking the test. \n", "$P(B∣A)$ is the likelihood function, the probability of testing positive if you have the disease. \n", "$P(B)$ is the probability of testing positive for the disease.\n", "\n", "With no more information about you and the disease, let's assume that $P(A)$ is equal to the frequency of occurrence of the disease in the population, $0.001$.\n", "$P(B∣A)$ is known, $0.99$.\n", "Concerning $P(B)$, we know there are two ways of testing positive for the disease: someone has the disease and the test result is positive and someone doesn't have the disease but the the test result is positive. The probability for the first case is $0.001 \\times 0.99$ and for the second case is $0.999 \\times 0.01$. The total probability of testing positive is the sum of these two probabilities. \n", "Them for $P(A∣B)$:\n", "\n", "$$ P(A\\mid B) = \\frac{P(B\\mid A)P(A)}{P(B)} = \\frac{0.99 \\times 0.001}{0.001 \\times 0.99 + 0.999 \\times 0.01} = 0.09 $$\n", "\n", "So, there is a 9% chance that you actually have the disease. Given the accuracy of the test, 99%, and the fact that you tested positive, only 9% of chance seems surprising." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-05-25T14:52:47.955441Z", "start_time": "2021-05-25T14:52:47.949626Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(A|B) = 0.09016\n" ] } ], "source": [ "# frequency of occurrence\n", "FO = 0.001\n", "# true positive\n", "TP = 0.99\n", "# false positive\n", "FP = 0.01 \n", "# probabilities\n", "PBA = 0.99\n", "PA = FO\n", "PB = PA * TP + (1-PA) * FP\n", "# then\n", "PAB = PBA * PA / PB\n", "print('P(A|B) = {:.5f}'.format(PAB))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given that the probability that you have the disease is only 10%, you decide to do another test. In statistical terms, this second test should be independent of the first test. \n", "Suppose you test positive again, now what is the probability that you actually have the disease?\n", "\n", "Now what is different is that we have more information about you and the disease, i.e., about the prior probability, P(A). Given that your first test was positive, we will use P(A) = 0.09016 and repeat the calculation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-05-25T14:52:48.045590Z", "start_time": "2021-05-25T14:52:48.042829Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(A|B) = 0.90750\n" ] } ], "source": [ "# true positive\n", "TP = 0.99\n", "# false positive\n", "FP = 0.01 \n", "# probabilities\n", "PBA = 0.99\n", "PA = 0.09016 # new prior\n", "PB = PA * TP + (1-PA) * FP\n", "# then\n", "PAB = PBA * PA / PB\n", "print('P(A|B) = {:.5f}'.format(PAB))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With two positive tests, there is a 91% probability that you actually have the disease.\n", "\n", "Note that even after two positive tests, this probability is still below the reported accuracy of the test (99%)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bayesian Clinical Diagnostic Model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-05-25T14:52:48.223319Z", "start_time": "2021-05-25T14:52:48.217624Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import IFrame\n", "IFrame('https://kennis-research.shinyapps.io/Bayes-App/', width='100%', height=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "- Cameron Davidson-Pilon (2015) [Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference](https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/). " ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }