{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Bayesian Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Preliminaries\n",
"\n",
"- Goals\n",
" - Introduction to Bayesian (i.e., probabilistic) modeling\n",
"- Materials\n",
" - Mandatory\n",
" - These lecture notes\n",
" - Optional\n",
" - Bishop pp. 21-24 "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Example Problem: Predicting a Coin Toss\n",
"\n",
"- **Question**. We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\\{hthhtth\\}\\,.$$\n",
"\n",
"- What is the probability that heads (h) comes up next?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- **Answer** later in this lecture. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### The Bayesian Machine Learning Framework\n",
"\n",
"- Suppose that your task is to predict a future observation $x$, based on $N$ past observations $D=\\{x_1,\\dotsc,x_N\\}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- The Bayesian approach for this task involves three stages: \n",
" 1. Model specification\n",
" 1. Parameter estimation (i.e., learning from observed data; using Bayesian inference)\n",
" 1. Prediction (apply the model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" \n",
"- Next, we discuss these three stages in a bit more detail."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### (1) Model specification\n",
"\n",
"Your first task is to propose a model with tuning parameters $\\theta$ for generating the observations $x$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- This involves specification of $p(x|\\theta)$ and a prior for the parameters $p(\\theta)$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- _You_ choose the distribution $p(x|\\theta)$ based on your physical understanding of the data generating process.\n",
"- Note that, for a given data set $D=\\{x_1,x_2,\\dots,x_N\\}$ with independent observations $x_n$,\n",
"$$ p(D|\\theta) = \\prod_{n=1}^N p(x_n|\\theta)$$\n",
"so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a model for your observed data set $D$. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- _You_ choose the prior $p(\\theta)$ to reflect what you know about the parameter values before you see the data $D$.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### (2) Parameter estimation\n",
"\n",
"- After model specification, you need to measure/collect a data set $D$. Then, use Bayes rule to find the posterior distribution for the parameters,\n",
"$$\n",
"p(\\theta|D) = \\frac{p(D|\\theta) p(\\theta)}{p(D)} \\propto p(D|\\theta) p(\\theta)\n",
"$$ \n",
" - The denominator is only a normalization factor."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Note that there's **no need for you to design a _smart_ parameter estimation algorithm**. The only complexity lies in the computational issues. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- This \"recipe\" works only if the RHS factors can be evaluated; this is what machine learning is about \n",
" $\\Rightarrow$ **Machine learning is EASY, apart from computational details :) **\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### (3) Prediction\n",
"\n",
"- Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by\n",
"$$\\begin{align*}\n",
"p(x|D) &= \\int p(x,\\theta|D) \\,\\mathrm{d}\\theta\\\\\n",
" &= \\int p(x|\\theta,D) p(\\theta|D) \\,\\mathrm{d}\\theta\\\\\n",
" &= \\int p(x|\\theta) p(\\theta|D) \\,\\mathrm{d}\\theta\\\\\n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Again, **no need to invent a special prediction algorithm**. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\\theta$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- In order to execute prediction, you need to have access to the factors $p(x|\\theta)$ and $p(\\theta|D)$. Where do these factors come from? Are they available?\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- What did we learn from $D$? Without access to $D$, we would predict new observations through\n",
"$$\n",
"p(x) = \\int p(x,\\theta) \\,\\mathrm{d}\\theta = \\int p(x|\\theta) p(\\theta) \\,\\mathrm{d}\\theta\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- NB The application of the learned posterior $p(\\theta|D)$ not necessarily has to be prediction. We use it here as an example, but other applications are of course also possible. \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Bayesian Model Comparison \n",
"\n",
"- There appears to be a remaining problem: How good really were our model assumptions $p(x|\\theta)$ and $p(\\theta)$? "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Technically, this is a **model comparison** problem"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- [**Q**.] What if I have more candidate models, say $\\mathcal{M} = \\{m_1,\\ldots,m_K\\}$ where each model relates to specific prior $p(\\theta|m_k)$ and likelihood $p(D|\\theta,m_k)$? Can we evaluate the relative performance of a model against another model from the set?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- [**A**.]: Start again with **model specification**. Specify a prior $p(m_k)$ for each of the models and then solve the desired inference problem: \n",
"$$\\begin{align*} \n",
"p(m_k|D) &= \\frac{p(D|m_k) p(m_k)}{p(D)} \\\\\n",
" &\\propto p(m_k) \\cdot p(D|m_k) \\\\\n",
" &= p(m_k)\\cdot \\int_\\theta p(D,\\theta|m_k) \\,\\mathrm{d}\\theta\\\\\n",
" &= \\underbrace{p(m_k)}_{\\substack{\\text{model}\\\\\\text{prior}}}\\cdot \\int_\\theta \\underbrace{p(D|\\theta,m_k)}_{\\text{likelihood}} \\,\\underbrace{p(\\theta|m_k)}_{\\text{prior}}\\, \\mathrm{d}\\theta\\\\\n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Bayesian Model Comparison (continued) \n",
"\n",
"- You, the engineer, have to choose the factors $p(D|\\theta,m_k)$, $p(\\theta|m_k)$ and $p(m_k)$. After that, for a given data set $D$, the model posterior $p(m_k|D)$ can be computed. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- If you need to work with one model,select the model with largest posterior $p(m_k|D)$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Alternatively, if you don't want to choose a model, you can do prediction by **Bayesian model averaging** to utilitize the predictive power from all models:\n",
"$$\\begin{align*}\n",
"p(x|D) &= \\sum_k \\int p(x,\\theta,m_k|D)\\,\\mathrm{d}\\theta \\\\\n",
" &= \\sum_k \\underbrace{p(m_k|D)}_{\\substack{\\text{model}\\\\\\text{posterior}}} \\cdot \\int \\underbrace{p(\\theta|D)}_{\\substack{\\text{parameter}\\\\\\text{posterior}}} \\, \\underbrace{p(x|\\theta,m_k)}_{\\text{likelihood}} \\,\\mathrm{d}\\theta\n",
"\\end{align*}$$ "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- $\\Rightarrow$ In a Bayesian framework, **model comparison** follows the same recipe as parameter estimation; it just works at one higher hierarchical level."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- More on this in part 2 (Tjalkens). "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Machine Learning and the Scientific Method Revisited\n",
"\n",
"- Bayesian probability theory provides a unified framework for information processing (and even the Scientific Method).\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Now Solve the Example Problem: Predicting a Coin Toss\n",
"\n",
"- We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\\{hthhtth\\}\\,.$$\n",
"\n",
"- What is the probability that heads (h) comes up next? We solve this in the next slides ..."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Coin toss example (1): Model Specification\n",
"\n",
"We observe a sequence of $N$ coin tosses $D=\\{x_1,\\ldots,x_N\\}$ with $n$ heads. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"##### Likelihood\n",
"- Assume a Bernoulli distributed variable $p(x_k=h|\\mu)=\\mu$, which leads to a **binomial** distribution for the likelihood (assume $n$ times heads were thrown):\n",
"$$ \n",
"p(D|\\mu) = \\prod_{k=1}^N p(x_k|\\mu) = \\mu^n (1-\\mu)^{N-n}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"##### Prior\n",
"- Assume the prior belief is governed by a **beta distribution**\n",
"$$\n",
"p(\\mu) = \\mathcal{B}(\\mu|\\alpha,\\beta) = \\frac{\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\beta)} \\mu^{\\alpha-1}(1-\\mu)^{\\beta-1}\n",
"$$\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- The Beta distribution is a **conjugate prior** for the Binomial distribution, which means that \n",
"$$\n",
"\\text{beta} \\propto \\text{binomial} \\times \\text{beta}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- $\\alpha$ and $\\beta$ are called **hyperparameters**, since they parameterize the distribution for another parameter ($\\mu$). E.g., $\\alpha=\\beta=1$ (uniform).\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Coin toss example (2): Parameter estimation\n",
"\n",
"- Infer posterior PDF over $\\mu$ through Bayes rule\n",
"\n",
"$$\\begin{align*}\n",
"p(\\mu|D) &\\propto p(D|\\mu)\\,p(\\mu|\\alpha,\\beta) \\\\ \n",
" &= \\mu^n (1-\\mu)^{N-n} \\times \\mu^{\\alpha-1} (1-\\mu)^{\\beta-1} \\\\\n",
" &= \\mu^{n+\\alpha-1} (1-\\mu)^{N-n+\\beta-1} \n",
"\\end{align*}$$\n",
"\n",
"hence the posterior is also beta distributed as\n",
"\n",
"$$\n",
"p(\\mu|D) = \\mathcal{B}(\\mu|\\,n+\\alpha, N-n+\\beta)\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Essentially, **here ends the machine learning activity**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Coin Toss Example (3): Prediction\n",
"\n",
"- Now, we want to **use** the trained model. Let's use it to predict future observations. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Marginalize over the parameter posterior to get the predictive PDF for a new coin toss $x_\\bullet$, given the data $D$,\n",
"\n",
"$$\\begin{align*}\n",
"p(x_\\bullet=h|D) &= \\int_0^1 p(x_\\bullet=h|\\mu)\\,p(\\mu|D) \\,\\mathrm{d}\\mu \\\\\n",
" &= \\int_0^1 \\mu \\times \\mathcal{B}(\\mu|\\,n+\\alpha, N-n+\\beta) \\,\\mathrm{d}\\mu \\\\\n",
" &= \\frac{n+\\alpha}{N+\\alpha+\\beta} \\qquad \\mbox{(a.k.a. Laplace rule)}\\hfill\n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Finally, we're ready to solve our example problem: for $D=\\{hthhtth\\}$ and uniform prior ($\\alpha=\\beta=1$), we get\n",
"\n",
"$$ p(x_\\bullet=h|D)=\\frac{n+1}{N+2} = \\frac{4+1}{7+2} = \\frac{5}{9}$$\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Coin Toss Example: What did we learn?\n",
"\n",
"- What did we learn from the data? Before seeing any data, we think that $$p(x_\\bullet=h)=\\left. p(x_\\bullet=h|D) \\right|_{n=N=0} = \\frac{\\alpha}{\\alpha + \\beta}\\,.$$ "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- After the $N$ coin tosses, we think that $p(x_\\bullet=h|D) = \\frac{n+\\alpha}{N+\\alpha+\\beta}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Note the following decomposition\n",
"\n",
"$$\\begin{align*}\n",
" p(x_\\bullet=h|\\,D) &= \\frac{n+\\alpha}{N+\\alpha+\\beta} = \\frac{n}{N+\\alpha+\\beta} + \\frac{\\alpha}{N+\\alpha+\\beta} \\\\\n",
" &= \\frac{N}{N+\\alpha+\\beta}\\cdot \\frac{n}{N} + \\frac{\\alpha+\\beta}{N+\\alpha+\\beta} \\cdot \\frac{\\alpha}{\\alpha+\\beta} \\\\\n",
" &= \\underbrace{\\frac{\\alpha}{\\alpha+\\beta}}_{prior} + \\underbrace{\\frac{N}{N+\\alpha+\\beta}}_{gain}\\cdot \\big( \\underbrace{\\frac{n}{N}}_{MLE} - \\underbrace{\\frac{\\alpha}{\\alpha+\\beta}}_{prior} \\big)\n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Note that, since $0\\leq\\text{gain}\\lt 1$, the Bayesian estimate lies between prior and maximum likelihood estimate."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- For large $N$, the gain goes to $1$ and $p(x_\\bullet=h|D)$ goes to the maximum likelihood estimate (the relative frequency) $n/N$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### CODE EXAMPLE\n",
"\n",
"**Bayesian evolution of $p(\\mu|D)$ for the coin toss**\n",
"\n",
"Let's see how $p(\\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n"
],
"text/plain": [
"HTML{String}(\" \\n\")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
"HTML{String}(\"\")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
"HTML{String}(\"\")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" \n"
],
"text/plain": [
"HTML{String}(\" \\n\")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
| Bayesian | Maximum Likelihood | |
| 1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\\theta,m)$ and parameter prior $p(\\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\\theta,m)$. No need for priors. |
| 2. Learning | use Bayes rule to find the parameter posterior,\n", "$$\n", "p(\\theta|D) = \\propto p(D|\\theta) p(\\theta)\n", "$$ | By Maximum Likelihood (ML) optimization,\n", "$$ \n", " \\hat \\theta = \\arg \\max_{\\theta} p(D |\\theta)\n", "$$ |
| 3. Prediction | $$\n", "p(x|D) = \\int p(x|\\theta) p(\\theta|D) \\,\\mathrm{d}\\theta\n", "$$ | \n", "$$ \n", " p(x|D) = p(x|\\hat\\theta)\n", "$$ |