{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probability Theory Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Preliminaries\n", "\n", "- Goal \n", " - Review of probability theory as a theory for rational/logical reasoning with uncertainties (i.e., a Bayesian interpretation)\n", "- Materials \n", " - Mandatory\n", " - These lecture notes\n", " - [Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-26 (sections 2.1 through 2.5), on deriving probability theory. You may skip section 2.3.4: Cox's proof (pp.15-18). \n", " - The assignment is only meant to appreciate how this line of \"axiomatic derivation\" of the rules of PT goes. I will not ask questions about any details of the derivations at the exam. \n", " \n", " - Optional\n", " - the pre-recorded video guide and live class of 2020\n", " - [Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-56 (ch.2: probability)\n", " - Great introduction to probability theory, in particular w.r.t. its correct interpretation as a state-of-knowledge.\n", " - Absolutely worth your time to read the whole chapter!\n", " - [Edwin Jaynes - 2003 - Probability Theory -- The Logic of Science](https://archive.org/details/ProbabilityTheoryTheLogicOfScience). \n", " - Brilliant book on Bayesian view on probability theory. Just for fun, scan the annotated bibliography and references.\n", " - Bishop pp. 12-24" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example Problem: Disease Diagnosis\n", "\n", "- **Problem**: Given a disease with prevalence of 1% and a test procedure with sensitivity ('true positive' rate) of 95% and specificity ('true negative' rate) of 85% , what is the chance that somebody who tests positive actually has the disease?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Solution**: Use probabilistic inference, to be discussed in this lecture. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Design of Probability Theory\n", "\n", "- Define an **event** (or \"proposition\") $A$ as a statement, whose truth can be contemplated by a person, e.g., \n", "\n", "$$𝐴= \\texttt{'there is life on Mars'}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If we assume the fact $$I = \\texttt{'All known life forms require water'}$$ and a new piece of information $$x = \\texttt{'There is water on Mars'}$$ becomes available, how _should_ our degree of belief in event $A$ be affected (if we were rational)? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- [Richard T. Cox, 1946](https://aapt.scitation.org/doi/10.1119/1.1990764) developed a **calculus for rational reasoning** about how to represent and update the degree of _beliefs_ about the truth value of events when faced with new information. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In developing this calculus, only some very agreeable assumptions were made, e.g.,\n", " - (Transitivity). If the belief in $A$ is greater than the belief in $B$, and the belief in $B$ is greater than the belief in $C$, then the belief in $A$ must be greater than the belief in $C$.\n", " - (Consistency). If the belief in an event can be inferred in two different ways, then the two ways must agree on the resulting belief." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- This effort resulted in confirming that the [sum and product rules of Probability Theory](#PT-calculus) are the **only** proper rational way to process belief intensities. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- $\\Rightarrow$ Probability theory (PT) provides _the_ **theory of optimal processing of incomplete information** (see [Cox theorem](https://en.wikipedia.org/wiki/Cox%27s_theorem), and [Caticha](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-24), and as such provides the (only) proper quantitative framework for drawing conclusions from a finite (read: incomplete) data set." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Why Probability Theory for Machine Learning?\n", "\n", "- Machine learning concerns drawing conclusions about model parameter settings from (a finite set of) data and therefore PT provides the _optimal calculus for machine learning_. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In general, nearly all interesting questions in machine learning can be stated in the following form (a conditional probability):\n", "\n", "$$p(\\texttt{whatever-we-want-to-know}\\, | \\,\\texttt{whatever-we-do-know})$$\n", "\n", "where $p(a|b)$ means the probability that $a$ is true, given that $b$ is true." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Examples\n", " - Predictions\n", " $$p(\\,\\texttt{future-observations}\\,|\\,\\texttt{past-observations}\\,)$$\n", " - Classify a received data point $x$ \n", " $$p(\\,x\\texttt{-belongs-to-class-}k \\,|\\,x\\,)$$\n", " - Update a model based on a new observation\n", " $$p(\\,\\texttt{model-parameters} \\,|\\,\\texttt{new-observation},\\,\\texttt{past-observations}\\,)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Frequentist vs. Bayesian Interpretation of Probabilities\n", "\n", "- The interpretation of a probability as a **degree-of-belief** about the truth value of an event is also called the **Bayesian** interpretation. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In the **Bayesian** interpretation, the probability is associated with a **state-of-knowledge** (usually held by a person). \n", " - For instance, in a coin tossing experiment, $p(\\texttt{tail}) = 0.4$ should be interpreted as the belief that there is a 40% chance that $\\texttt{tail}$ comes up if the coin were tossed.\n", " - Under the Bayesian interpretation, PT calculus (sum and product rules) **extends boolean logic to rational reasoning with uncertainty**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The Bayesian interpretation contrasts with the **frequentist** interpretation of a probability as the relative frequency that an event would occur under repeated execution of an experiment.\n", "\n", " - For instance, if the experiment is tossing a coin, then $p(\\texttt{tail}) = 0.4$ means that in the limit of a large number of coin tosses, 40% of outcomes turn up as $\\texttt{tail}$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Bayesian viewpoint is more generally applicable than the frequentist viewpoint, e.g., it is hard to apply the frequentist viewpoint to events like '$\\texttt{it will rain tomorrow}$'. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Bayesian viewpoint is clearly favored in the machine learning community. (In this class, we also strongly favor the Bayesian interpretation). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Probability Theory Notation\n", "\n", "##### events\n", "- Define an **event** $A$ as a statement, whose truth can be contemplated by a person, e.g.,\n", "\n", "$$A = \\text{'it will rain tomorrow'}$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We write the denial of $A$, i.e. the event **not**-A, as $\\bar{A}$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Given two events $A$ and $B$, we write the **conjunction** \"$A \\wedge B$\" as \"$A,B$\" or \"$AB$\". The conjunction $AB$ is true only if both $A$ and $B$ are true. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We will write the **disjunction** \"$A \\lor B$\" as \"$A + B$\", which is true if either $A$ or $B$ is true or both $A$ and $B$ are true. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that, if $X$ is a variable, then an assignment $X=x$ (with $x$ a value, e.g., $X=5$) can be interpreted as an event. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### probabilities\n", "\n", "- For any event $A$, with background knowledge $I$, the **conditional probability of $A$ given $I$**, is written as \n", "$$p(A|I)\\,.$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- All probabilities are in principle conditional probabilities of the type $p(A|I)$, since there is always some background knowledge. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Unfortunately, PT notation is usually rather sloppy :(\n", "\n", "- We often write $p(A)$ rather than $p(A|I)$ if the background knowledge $I$ is assumed to be obviously present. E.g., $p(A)$ rather than $p(\\,A\\,|\\,\\text{the-sun-comes-up-tomorrow}\\,)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- (In the context of random variable assignments) we often write $p(x)$ rather than $p(X=x)$, assuming that the reader understands the context. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In an apparent effort to further abuse notational conventions, $p(X)$ denotes the full distribution over random variable $X$, i.e., the distribution for all assignments for $X$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If $X$ is a *discretely* valued variable, then $p(X=x)$ is a probability *mass* function (PMF) with $0\\le p(X=x)\\le 1$ and normalization $\\sum_x p(x) =1$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If $X$ is *continuously* valued, then $p(X=x)$ is a probability *density* function (PDF) with $p(X=x)\\ge 0$ and normalization $\\int_x p(x)\\mathrm{d}x=1$. \n", " - Note that if $X$ is continuously valued, then the value of the PDF $p(x)$ is not necessarily $\\le 1$. E.g., a uniform distribution on the continuous domain $[0,.5]$ has value $p(x) = 2$.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Often, we do not bother to specify if $p(x)$ refers to a continuous or discrete variable. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Probability Theory Calculus\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Let $p(A|I)$ indicate the belief in event $A$, given that $I$ is true. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The following product and sum rules are also known as the **axioms of probability theory**, but as discussed above, under some mild assumptions, they can be derived as the unique rules for *rational reasoning under uncertainty* ([Cox theorem, 1946](https://en.wikipedia.org/wiki/Cox%27s_theorem), and [Caticha, 2012](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Caticha-2012-Entropic-Inference-and-the-Foundations-of-Physics.pdf), pp.7-26)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Sum rule**. The disjunction for two events $A$ and $B$ given background $I$ is given by\n", "$$ \\boxed{p(A+B|I) = p(A|I) + p(B|I) - p(A,B|I)}$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Product rule**. The conjuction of two events $A$ and $B$ with given background $I$ is given by \n", "$$ \\boxed{p(A,B|I) = p(A|B,I)\\,p(B|I)}$$\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **All legitimate probabilistic relations can be derived from the sum and product rules!**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Independent and Mutually Exclusive Events\n", "\n", "- Two events $A$ and $B$ are said to be **independent** if the probability of one is not altered by information about the truth of the other, i.e., $p(A|B) = p(A)$\n", " - $\\Rightarrow$ If $A$ and $B$ are independent, given $I$, then the product rule simplifies to $$p(A,B|I) = p(A|I) p(B|I)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Two events $A_1$ and $A_2$ are said to be **mutually exclusive** if they cannot be true simultanously, i.e., if $p(A_1,A_2)=0$.\n", " - $\\Rightarrow$ For mutually exclusive events, the sum rule simplifies to\n", " $$p(A_1+A_2) = p(A_1) + p(A_2)$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A set of events $A_1, A_2, \\ldots, A_N$ is said to be **collectively exhaustive** if one of the statements is necessarily true, i.e., $A_1+A_2+\\cdots +A_N=\\mathrm{TRUE}$, or equivalently \n", "$$p(A_1+A_2+\\cdots +A_N)=1$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that, if $A_1, A_2, \\ldots, A_n$ are both **mutually exclusive** and **collectively exhausitive** (MECE) events, then\n", " $$\\sum_{n=1}^N p(A_n) = p(A_1 + \\ldots + A_N) = 1$$\n", " - More generally, if $\\{A_n\\}$ are MECE events, then $\\sum_{n=1}^N p(A_n,B) = p(B)$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Sum Rule and Marginalization\n", "\n", "- We mentioned that every inference problem in PT can be evaluated through the sum and product rules. Next, we present two useful corollaries: (1) _Marginalization_ and (2) _Bayes rule_ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If $X \\in \\mathcal{X}$ and $Y \\in \\mathcal{Y}$ are random variables over finite domains, than it follows from the above considerations about MECE events that \n", "$$\n", "\\sum_{Y\\in \\mathcal{Y}} p(X,Y) = p(X) \\,.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Summing $Y$ out of a joint distribution $p(X,Y)$ is called **marginalization** and the result $p(X)$ is sometimes referred to as the **marginal probability**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that this is just a **generalized sum rule**. In fact, Bishop (p.14) (and some other authors as well) calls this the sum rule.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Of course, in the continuous domain, the (generalized) sum rule becomes\n", "$$p(X)=\\int p(X,Y) \\,\\mathrm{d}Y$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Product Rule and Bayes Rule\n", "\n", "- Consider two variables $D$ and $\\theta$; it follows from symmetry arguments that \n", "$$p(D,\\theta)=p(D|\\theta)p(\\theta)=p(\\theta|D)p(D)$$ \n", "and hence that\n", "$$ p(\\theta|D) = \\frac{p(D|\\theta) }{p(D)}p(\\theta)\\,.$$ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This formula is called **Bayes rule** (or Bayes theorem). While Bayes rule is always true, a particularly useful application occurs when $D$ refers to an observed data set and $\\theta$ is set of model parameters. In that case,\n", "\n", " - the **prior** probability $p(\\theta)$ represents our **state-of-knowledge** about proper values for $\\theta$, before seeing the data $D$.\n", " - the **posterior** probability $p(\\theta|D)$ represents our state-of-knowledge about $\\theta$ after we have seen the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$\\Rightarrow$ Bayes rule tells us how to update our knowledge about model parameters when facing new data. Hence, \n", "\n", "