{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bayesian Machine Learning" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Preliminaries\n", "\n", "- Goals\n", " - Introduction to Bayesian (i.e., probabilistic) modeling\n", "- Materials\n", " - Mandatory\n", " - These lecture notes\n", " - Optional\n", " - Bishop pp. 21-24 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example Problem: Predicting a Coin Toss\n", "\n", "- **Question**. We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\\{hthhtth\\}\\,.$$\n", "\n", "- What is the probability that heads (h) comes up next?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Answer** later in this lecture. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Bayesian Machine Learning Framework\n", "\n", "- Suppose that your task is to predict a 'new' datum $x$, based on $N$ observations $D=\\{x_1,\\dotsc,x_N\\}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Bayesian approach for this task involves three stages: \n", " 1. Model specification\n", " 1. parameter estimation (inference, learning)\n", " 1. Prediction (apply the model)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " \n", "- Next, we discuss these three stages in a bit more detail." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### (1) Model specification\n", "\n", "Your first task is to propose a model with tuning parameters $\\theta$ for generating the data $D$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This involves specification of $p(D|\\theta)$ and a prior for the parameters $p(\\theta)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- _You_ choose the distribution $p(D|\\theta)$ based on your physical understanding of the data generating process.\n", " - Note that, for independent observations $x_n$,\n", "$$ p(D|\\theta) = \\prod_{n=1}^N p(x_n|\\theta)$$\n", "so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a model for $D$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- _You_ choose the prior $p(\\theta)$ to reflect what you know about the parameter values before you see the data $D$.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### (2) Parameter estimation\n", "\n", "- After model specification, you need to measure/collect a data set $D$. Then, use Bayes rule to find the posterior distribution for the parameters,\n", "$$\n", "p(\\theta|D) = \\frac{p(D|\\theta) p(\\theta)}{p(D)} \\propto p(D|\\theta) p(\\theta)\n", "$$ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that there's **no need for you to design a _smart_ parameter estimation algorithm**. The only complexity lies in the computational issues. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This \"recipe\" works only if the RHS factors can be evaluated; this is what machine learning is about \n", " $\\Rightarrow$ **Machine learning is easy, apart from computational details:)**\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### (3) Prediction\n", "\n", "- Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by\n", "$$\\begin{align*}\n", "p(x|D) &= \\int p(x,\\theta|D) \\,\\mathrm{d}\\theta\\\\\n", " &= \\int p(x|\\theta,D) p(\\theta|D) \\,\\mathrm{d}\\theta\\\\\n", " &= \\int p(x|\\theta) p(\\theta|D) \\,\\mathrm{d}\\theta\\\\\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Again, **no need to invent a special prediction algorithm**. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\\theta$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In order to execute prediction, you need to have access to the factors $p(x|\\theta)$ and $p(\\theta|D)$. Where do these factors come from? Are they available?\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- What did we learn from $D$? Without access to $D$, we would predict new observations through\n", "$$\n", "p(x) = \\int p(x,\\theta) \\,\\mathrm{d}\\theta = \\int p(x|\\theta) p(\\theta) \\,\\mathrm{d}\\theta\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- NB The application of the learned posterior $p(\\theta|D)$ not necessarily has to be prediction. We use it here as an example, but other applications are of course also possible. \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Bayesian Model Comparison \n", "\n", "- There appears to be a remaining problem: How good really were our model assumptions $p(x|\\theta)$ and $p(\\theta)$? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Technically, this is a **model comparison** problem" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [**Q**.] What if I have more candidate models, say $\\mathcal{M} = \\{m_1,\\ldots,m_K\\}$ where each model relates to specific prior $p(\\theta|m_k)$ and likelihood $p(D|\\theta,m_k)$? Can we evaluate the relative performance of a model against another model from the set?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [**A**.]: Start again with **model specification**. Specify a prior $p(m_k)$ for each of the models and then solve the desired inference problem: \n", "$$\\begin{align*} \n", "p(m_k|D) &= \\frac{p(D|m_k) p(m_k)}{p(D)} \\\\\n", " &\\propto p(m_k) \\cdot p(D|m_k) \\\\\n", " &= p(m_k)\\cdot \\int_\\theta p(D,\\theta|m_k) \\,\\mathrm{d}\\theta\\\\\n", " &= \\underbrace{p(m_k)}_{\\substack{\\text{model}\\\\\\text{prior}}}\\cdot \\int_\\theta \\underbrace{p(D|\\theta,m_k)}_{\\text{likelihood}} \\,\\underbrace{p(\\theta|m_k)}_{\\text{prior}}\\, \\mathrm{d}\\theta\\\\\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Bayesian Model Comparison (continued) \n", "\n", "- You, the engineer, have to choose the factors $p(D|\\theta,m_k)$, $p(\\theta|m_k)$ and $p(m_k)$. After that, for a given data set $D$, the model posterior $p(m_k|D)$ can be computed. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- If you need to work with one model,select the model with largest posterior $p(m_k|D)$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Alternatively, if you don't want to choose a model, you can do prediction by **Bayesian model averaging** to utilitize the predictive power from all models:\n", "$$\\begin{align*}\n", "p(x|D) &= \\sum_k \\int p(x,\\theta,m_k|D)\\,\\mathrm{d}\\theta \\\\\n", " &= \\sum_k \\underbrace{p(m_k|D)}_{\\substack{\\text{model}\\\\\\text{posterior}}} \\cdot \\int \\underbrace{p(\\theta|D)}_{\\substack{\\text{parameter}\\\\\\text{posterior}}} \\, \\underbrace{p(x|\\theta,m_k)}_{\\text{likelihood}} \\,\\mathrm{d}\\theta\n", "\\end{align*}$$ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- $\\Rightarrow$ In a Bayesian framework, **model comparison** follows the same recipe as parameter estimation; it just works at one higher hierarchical level." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- More on this in part 2 (Tjalkens). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Machine Learning and the Scientific Method Revisited\n", "\n", "- Bayesian probability theory provides a unified framework for information processing (and even the Scientific Method).\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Now Solve the Example Problem: Predicting a Coin Toss\n", "\n", "- We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\\{hthhtth\\}\\,.$$\n", "\n", "- What is the probability that heads (h) comes up next? We solve this in the next slides ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Coin toss example (1): Model Specification\n", "\n", "We observe a sequence of $N$ coin tosses $D=\\{x_1,\\ldots,x_N\\}$ with $n$ heads. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "##### Likelihood\n", "- Assume a Bernoulli distributed variable $p(x_k=h|\\mu)=\\mu$, which leads to a **binomial** distribution for the likelihood (assume $n$ times heads were thrown):\n", "$$ \n", "p(D|\\mu) = \\prod_{k=1}^N p(x_k|\\mu) = \\mu^n (1-\\mu)^{N-n}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "##### Prior\n", "- Assume the prior belief is governed by a **beta distribution**\n", "$$\n", "p(\\mu) = \\mathcal{B}(\\mu|\\alpha,\\beta) = \\frac{\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\beta)} \\mu^{\\alpha-1}(1-\\mu)^{\\beta-1}\n", "$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Beta distribution is a **conjugate prior** for the Binomial distribution, which means that \n", "$$\n", "\\text{beta} \\propto \\text{binomial} \\times \\text{beta}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- $\\alpha$ and $\\beta$ are called **hyperparameters**, since they parameterize the distribution for another parameter ($\\mu$). E.g., $\\alpha=\\beta=1$ (uniform).\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Coin toss example (2): Parameter estimation\n", "\n", "- Infer posterior PDF over $\\mu$ through Bayes rule\n", "\n", "$$\\begin{align*}\n", "p(\\mu|D) &\\propto p(D|\\mu)\\,p(\\mu|\\alpha,\\beta) \\\\ \n", " &= \\mu^n (1-\\mu)^{N-n} \\times \\mu^{\\alpha-1} (1-\\mu)^{\\beta-1} \\\\\n", " &= \\mu^{n+\\alpha-1} (1-\\mu)^{N-n+\\beta-1} \n", "\\end{align*}$$\n", "\n", "hence the posterior is also beta distributed as\n", "\n", "$$\n", "p(\\mu|D) = \\mathcal{B}(\\mu|\\,n+\\alpha, N-n+\\beta)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Essentially, **here ends the machine learning activity**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Coin Toss Example (3): Prediction\n", "\n", "- Now, we want to **use** the trained model. Let's use it to predict future observations. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Marginalize over the parameter posterior to get the predictive PDF for a new coin toss $x_\\bullet$, given the data $D$,\n", "\n", "$$\\begin{align*}\n", "p(x_\\bullet=h|D) &= \\int_0^1 p(x_\\bullet=h|\\mu)\\,p(\\mu|D) \\,\\mathrm{d}\\mu \\\\\n", " &= \\int_0^1 \\mu \\times \\mathcal{B}(\\mu|\\,n+\\alpha, N-n+\\beta) \\,\\mathrm{d}\\mu \\\\\n", " &= \\frac{n+\\alpha}{N+\\alpha+\\beta} \\qquad \\mbox{(a.k.a. Laplace rule)}\\hfill\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Finally, we're ready to solve our example problem: for $D=\\{hthhtth\\}$ and uniform prior ($\\alpha=\\beta=1$), we get\n", "\n", "$$ p(x_\\bullet=h|D)=\\frac{n+1}{N+2} = \\frac{4+1}{7+2} = \\frac{5}{9}$$\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Coin Toss Example: What did we learn?\n", "\n", "- What did we learn from the data? Before seeing any data, we think that $$p(x_\\bullet=h)=\\left. p(x_\\bullet=h|D) \\right|_{n=N=0} = \\frac{\\alpha}{\\alpha + \\beta}\\,.$$ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- After the $N$ coin tosses, we think that $p(x_\\bullet=h|D) = \\frac{n+\\alpha}{N+\\alpha+\\beta}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note the following decomposition\n", "\n", "$$\\begin{align*}\n", " p(x_\\bullet=h|\\,D) &= \\frac{n+\\alpha}{N+\\alpha+\\beta} = \\frac{n}{N+\\alpha+\\beta} + \\frac{\\alpha}{N+\\alpha+\\beta} \\\\\n", " &= \\frac{N}{N+\\alpha+\\beta}\\cdot \\frac{n}{N} + \\frac{\\alpha+\\beta}{N+\\alpha+\\beta} \\cdot \\frac{\\alpha}{\\alpha+\\beta} \\\\\n", " &= \\underbrace{\\frac{\\alpha}{\\alpha+\\beta}}_{prior} + \\underbrace{\\frac{N}{N+\\alpha+\\beta}}_{gain}\\cdot \\big( \\underbrace{\\frac{n}{N}}_{MLE} - \\underbrace{\\frac{\\alpha}{\\alpha+\\beta}}_{prior} \\big)\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that, since $0\\leq\\text{gain}\\lt 1$, the Bayesian estimate lies between prior and maximum likelihood estimate." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For large $N$, the gain goes to $1$ and $p(x_\\bullet=h|D)$ goes to the maximum likelihood estimate (the relative frequency) $n/N$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### CODE EXAMPLE\n", "\n", "**Bayesian evolution of $p(\\mu|D)$ for the coin toss**\n", "\n", "Let's see how $p(\\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Bayesian | Maximum Likelihood | |
1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\\theta,m)$ and parameter prior $p(\\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\\theta,m)$. No need for priors. |
2. Learning | use Bayes rule to find the parameter posterior,\n", "$$\n", "p(\\theta|D) = \\propto p(D|\\theta) p(\\theta)\n", "$$ | By Maximum Likelihood (ML) optimization,\n", "$$ \n", " \\hat \\theta = \\arg \\max_{\\theta} p(D |\\theta)\n", "$$ |
3. Prediction | $$\n", "p(x|D) = \\int p(x|\\theta) p(\\theta|D) \\,\\mathrm{d}\\theta\n", "$$ | \n", "$$ \n", " p(x|D) = p(x|\\hat\\theta)\n", "$$ |