{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probability Theory Review" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Preliminaries\n", "\n", "- Goal \n", " - Review of Probability Theory as a theory for rational/logical reasoning with uncertainties (i.e., a Bayesian interpretation)\n", "- Materials \n", " - Mandatory\n", " - These lecture notes\n", " - Optional\n", " - Bishop pp. 12-24\n", " - [Ariel Caticha, Entropic Inference and the Foundations of Physics (2012)](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-56 (ch.2: probability)\n", " - Great introduction to probability theory, in particular w.r.t. its correct interpretation as a state-of-knowledge.\n", " - Absolutely worth your time to read the whole chapter, even if you skip section 2.2.4 (pp.15-18) on Cox's proof.\n", " - [Edwin Jaynes, Probability Theory--The Logic of Science (2003)](http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf). \n", " - Brilliant book on Bayesian view on probability theory. Just for fun, scan the annotated bibliography and references.\n", " - [Aubrey Clayton, Bernoulli's Fallacy--Statistical Illogic and the Crisis of Modern Science (2021)](https://aubreyclayton.com/bernoulli)\n", " - A very readable account on the history of statistics and probability theory. Discusses why most popular statistics recipes are very poor scientific analysis tools. Use probability theory instead!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Data Analysis: A Bayesian Tutorial](https://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia/dp/0198568320)\n", "\n", "- The following is an excerpt from the book [Data Analysis: A Bayesian Tutorial](https://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia/dp/0198568320) (2006), by D.S. Sivia with J.S. Skilling:\n", "\n", "
\n", "\n", "- Does this fragment resonate with your own experience? \n", "\n", "- In this lesson we introduce *Probability Theory* (PT) again. As we will see in the next lessons, PT is all you need to make sense of machine learning, artificial intelligence, statistics, etc. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example Problem: Disease Diagnosis\n", "\n", "- **Problem**: Given a disease with prevalence of 1% and a test procedure with sensitivity ('true positive' rate) of 95% and specificity ('true negative' rate) of 85% , what is the chance that somebody who tests positive actually has the disease?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Solution**: Use probabilistic inference, to be discussed in this lecture. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Design of Probability Theory\n", "\n", "- Define an **event** (or \"proposition\") $A$ as a statement, whose truth can be contemplated by a person, e.g., \n", "\n", "$$𝐴= \\texttt{'there is life on Mars'}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If we assume the fact $$I = \\texttt{'All known life forms require water'}$$ and a new piece of information $$x = \\texttt{'There is water on Mars'}$$ becomes available, how _should_ our degree of belief in event $A$ be affected (if we were rational)? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- [Richard T. Cox, 1946](https://aapt.scitation.org/doi/10.1119/1.1990764) developed a **calculus for rational reasoning** about how to represent and update the degree of _beliefs_ about the truth value of events when faced with new information. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In developing this calculus, only some very agreeable assumptions were made, e.g.,\n", " - (Representation). Degrees of rational belief (or, as we shall later call them, probabilities) are represented by real numbers.\n", " - (Transitivity). If the belief in $A$ is greater than the belief in $B$, and the belief in $B$ is greater than the belief in $C$, then the belief in $A$ must be greater than the belief in $C$.\n", " - (Consistency). If the belief in an event can be inferred in two different ways, then the two ways must agree on the resulting belief." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- This effort resulted in confirming that the **sum and product rules of Probability Theory** [(to be discussed below)](#PT-calculus) are the **only** proper rational way to process belief intensities. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- $\\Rightarrow$ Probability theory (PT) provides _the_ **theory of optimal processing of incomplete information** (see [Cox theorem](https://en.wikipedia.org/wiki/Cox%27s_theorem), and [Caticha](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-24), and as such provides the (only) proper quantitative framework for drawing conclusions from a finite (read: incomplete) data set." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Why Probability Theory for Machine Learning?\n", "\n", "- Machine learning concerns drawing conclusions about model parameter settings from (a finite set of) data and therefore PT provides the _optimal calculus for machine learning_. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In general, nearly all interesting questions in machine learning can be stated in the following form (a conditional probability):\n", "\n", "$$p(\\texttt{whatever-we-want-to-know}\\, | \\,\\texttt{whatever-we-do-know})$$\n", "\n", "where $p(a|b)$ means the probability that $a$ is true, given that $b$ is true." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Examples\n", " - Predictions\n", " $$p(\\,\\texttt{future-observations}\\,|\\,\\texttt{past-observations}\\,)$$\n", " - Classify a received data point $x$ \n", " $$p(\\,x\\texttt{-belongs-to-class-}k \\,|\\,x\\,)$$\n", " - Update a model based on a new observation\n", " $$p(\\,\\texttt{model-parameters} \\,|\\,\\texttt{new-observation},\\,\\texttt{past-observations}\\,)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Frequentist vs. Bayesian Interpretation of Probabilities\n", "\n", "- The interpretation of a probability as a **degree-of-belief** about the truth value of an event is also called the **Bayesian** interpretation. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In the **Bayesian** interpretation, the probability is associated with a **state-of-knowledge** (usually held by a person, but formally by a rational agent). \n", " - For instance, in a coin tossing experiment, $p(\\texttt{tail}) = 0.4$ should be interpreted as the belief that there is a 40% chance that $\\texttt{tail}$ comes up if the coin were tossed.\n", " - Under the Bayesian interpretation, PT calculus (sum and product rules) **extends boolean logic to rational reasoning with uncertainty**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The Bayesian interpretation contrasts with the **frequentist** interpretation of a probability as the relative frequency that an event would occur under repeated execution of an experiment.\n", "\n", " - For instance, if the experiment is tossing a coin, then $p(\\texttt{tail}) = 0.4$ means that in the limit of a large number of coin tosses, 40% of outcomes turn up as $\\texttt{tail}$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Bayesian viewpoint is more generally applicable than the frequentist viewpoint, e.g., it is hard to apply the frequentist viewpoint to events like '$\\texttt{it will rain tomorrow}$'. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Bayesian viewpoint is clearly favored in the machine learning community. (In this class, we also strongly favor the Bayesian interpretation). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Aubrey Clayton, in his wonderful book [Bernoulli's fallacy](https://aubreyclayton.com/bernoulli) (2021), writes about this issue: \n", " > “Compared with Bayesian methods, standard [frequentist] statistical techniques use only a small fraction of the available information about a research hypothesis (how well it predicts some observation), so naturally they will struggle when that limited information proves inadequate. Using standard statistical methods is like driving a car at night on a poorly lit highway: to keep from going in a ditch, we could build an elaborate system of bumpers and guardrails and equip the car with lane departure warnings and sophisticated navigation systems, and even then we could at best only drive to a few destinations. Or we could turn on the headlights.” " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- In this class, we aim to turn on the headlights and illuminate the elegance and power of the Bayesian approach to information processing. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Probability Theory Notation\n", "\n", "##### events\n", "- We define an **event** $A$ as a statement, whose truth can be contemplated by a person, e.g.,\n", "\n", "$$A = \\text{`it will rain tomorrow'}$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We write the denial of $A$, i.e. the event **not**-A, as $\\bar{A}$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Given two events $A$ and $B$, we will shortly write the **conjunction** \"$A \\wedge B$\" as \"$A,B$\" or \"$AB$\". The conjunction $AB$ is true only if both $A$ and $B$ are true. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We will write the **disjunction** \"$A \\lor B$\" as \"$A + B$\", which is true if either $A$ or $B$ is true or both $A$ and $B$ are true. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that, if $X$ is a variable, then an assignment $X=x$ (with $x$ a value, e.g., $X=5$) can be interpreted as an event. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### probabilities\n", "\n", "- For any event $A$, with background knowledge $I$, the **conditional probability of $A$ given $I$**, is written as \n", "$$p(A|I)\\,.$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- All probabilities are in principle conditional probabilities of the type $p(A|I)$, since there is always some background knowledge. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Unfortunately, PT notation is usually rather sloppy :(\n", "\n", "- We often write $p(A)$ rather than $p(A|I)$ if the background knowledge $I$ is assumed to be obviously present. E.g., $p(A)$ rather than $p(\\,A\\,|\\,\\text{the-sun-comes-up-tomorrow}\\,)$." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- (In the context of variable assignments) we often write $p(x)$ rather than $p(X=x)$, assuming that the reader understands the context. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In an apparent effort to further abuse notational conventions, $p(X)$ denotes the full distribution over variable $X$, i.e., the distribution for all assignments for $X$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If $X$ is a *discretely* valued variable, then $p(X=x)$ is a probability *mass* function (PMF) with $0\\le p(X=x)\\le 1$ and normalization $\\sum_x p(x) =1$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If $X$ is *continuously* valued, then $p(X=x)$ is a probability *density* function (PDF) with $p(X=x)\\ge 0$ and normalization $\\int_x p(x)\\mathrm{d}x=1$. \n", " - Note that if $X$ is continuously valued, then the value of the PDF $p(x)$ is not necessarily $\\le 1$. E.g., a uniform distribution on the continuous domain $[0,.5]$ has value $p(x) = 2$.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Often, we do not bother to specify if $p(x)$ refers to a continuous or discrete variable. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Probability Theory Calculus\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Let $p(A|I)$ indicate the belief in event $A$, given that $I$ is true. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The following product and sum rules are also known as the **axioms of probability theory**, but as discussed above, under some mild assumptions, they can be derived as the unique rules for *rational reasoning under uncertainty* ([Cox theorem, 1946](https://en.wikipedia.org/wiki/Cox%27s_theorem), and [Caticha, 2012](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-26)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Sum rule**. The disjunction for two events $A$ and $B$ given background $I$ is given by\n", "$$ \\boxed{p(A+B|I) = p(A|I) + p(B|I) - p(A,B|I)}$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Product rule**. The conjuction of two events $A$ and $B$ with given background $I$ is given by \n", "$$ \\boxed{p(A,B|I) = p(A|B,I)\\,p(B|I)}$$\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **All legitimate probabilistic relations can be derived from the sum and product rules!**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Independent and Mutually Exclusive Events\n", "\n", "- It will be helpful to introduce some terms concerning special probabilistic relationships between events. \n", "\n", "- Two events $A$ and $B$ are said to be **independent** if the probability of one is not altered by information about the truth of the other, i.e., $p(A|B) = p(A)$\n", " - $\\Rightarrow$ If $A$ and $B$ are independent, given $I$, then the product rule simplifies to $$p(A,B|I) = p(A|I) p(B|I)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Two events $A_1$ and $A_2$ are said to be **mutually exclusive** if they cannot be true simultanously, i.e., if $p(A_1,A_2)=0$.\n", " - $\\Rightarrow$ For mutually exclusive events, the sum rule simplifies to\n", " $$p(A_1+A_2) = p(A_1) + p(A_2)$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A set of events $A_1, A_2, \\ldots, A_N$ is said to be **collectively exhaustive** if one of the statements is necessarily true, i.e., $A_1+A_2+\\cdots +A_N=\\mathrm{TRUE}$, or equivalently \n", "$$p(A_1+A_2+\\cdots +A_N)=1$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that, if $A_1, A_2, \\ldots, A_n$ are both **mutually exclusive** and **collectively exhausitive** (MECE) events, then\n", " $$\\sum_{n=1}^N p(A_n) = p(A_1 + \\ldots + A_N) = 1$$\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Sum Rule and Marginalization\n", "\n", "- We mentioned that every inference problem in PT can be evaluated through the sum and product rules. Next, we present two useful corollaries: (1) _Marginalization_ and (2) _Bayes rule_ " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If $X \\in \\mathcal{X}$ and $Y \\in \\mathcal{Y}$ are variables over finite domains, then it follows that \n", "$$\n", "\\sum_{Y\\in \\mathcal{Y}} p(X,Y) = p(X) \\,.\n", "$$\n", " - Proof:\n", " $$\\begin{align*}\n", " \\sum_{Y\\in \\mathcal{Y}} p(X,Y) &= \\sum_{Y\\in \\mathcal{Y}} p(Y|X) p(X) \\\\\n", " &= p(X) \\underbrace{\\sum_{Y\\in \\mathcal{Y}} p(Y|X)}_{=1} \\\\\n", " &= p(X)\n", " \\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Summing $Y$ out of a joint distribution $p(X,Y)$ is called **marginalization** and the result $p(X)$ is sometimes referred to as the **marginal probability**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that this is just a **generalized sum rule**. In fact, Bishop (p.14) (and some other authors as well) calls this the sum rule.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Of course, in the continuous domain, the (generalized) sum rule becomes\n", "$$p(X)=\\int p(X,Y) \\,\\mathrm{d}Y$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Product Rule and Bayes Rule\n", "\n", "- Consider two variables $D$ and $\\theta$; it follows from symmetry arguments that \n", "$p(D,\\theta)=p(\\theta,D)$, and hence that\n", "$$p(D|\\theta)p(\\theta)=p(\\theta|D)p(D)$$ \n", "or, equivalently,\n", "$$ p(\\theta|D) = \\frac{p(D|\\theta) }{p(D)}p(\\theta)\\,.\\qquad \\text{(Bayes rule)}$$ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This last formula is called **Bayes rule** (or Bayes theorem). While Bayes rule is always true, a particularly useful application occurs when $D$ refers to an observed data set and $\\theta$ is set of model parameters. In that case,\n", "\n", " - the **prior** probability $p(\\theta)$ represents our **state-of-knowledge** about proper values for $\\theta$, before seeing the data $D$.\n", " - the **posterior** probability $p(\\theta|D)$ represents our state-of-knowledge about $\\theta$ after we have seen the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$\\Rightarrow$ Bayes rule tells us how to update our knowledge about model parameters when facing new data. Hence, \n", "\n", "