{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The General EM Algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"### Preliminaries\n",
"\n",
"- Goal \n",
" - A more formal treatment of the general EM algorithm \n",
"- Materials\n",
" - Mandatory\n",
" - These lecture notes\n",
" - Optional \n",
" - Bishop pp. 55-57 for Jensen's inequality\n",
" - Bishop pp. 439-443 for EM applied to GMM\n",
" - Bishop pp. 450-455 for the general EM algorithm\n",
" - Liang (2015) [Technical Details about the EM algorithm](./files/Liang-2015-technical-details-about-the-EM-algorithm.pdf)\n",
" - Bo and Batzoglou (2008) [What is the expectation algorithm?](./files/Bo-2008-What-is-the-expectation-algorithm.pdf)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### The Kullback-Leibler Divergence\n",
"\n",
"- In order to prove that the EM algorithm works, we will need [Gibbs inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality), which is a famous theorem in information theory.\n",
"\n",
"- Definition: the **Kullback-Leibler divergence** (a.k.a. **relative entropy**) is a distance measure between two distributions $q$ and $p$ and is defined as\n",
"\n",
"$$\n",
"D_{\\text{KL}}(q \\parallel p) \\triangleq \\sum_z q(z) \\log \\frac{q(z)}{p(z)}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Theorem: **Gibbs Inequality** ([proof](https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof) uses Jensen inquality): \n",
"$$\\boxed{ D_{\\text{KL}}(q \\parallel p) \\geq 0 }$$\n",
"with equality only iff $p=q$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Note that the KL divergence is an asymmetric distance measure, i.e. in general \n",
"\n",
"$$D_{\\text{KL}}(q \\parallel p) \\neq D_{\\text{KL}}(p \\parallel q)$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### EM by maximizing a Lower Bound on the Log-Likelihood\n",
"\n",
"- Consider a model for observations $x$, hidden variables $z$ and tuning parameters $\\theta$. Note that, for **any** distribution $q(z)$, we can expand the log-likelihood as follows: \n",
"\n",
"$$\\begin{align*}\n",
"\\mathrm{L}(\\theta) &\\triangleq \\log p(x|\\theta) \\\\\n",
" &= \\sum_z q(z) \\log p(x|\\theta) \\\\\n",
" &= \\sum_z q(z) \\left( \\log p(x|\\theta) - \\log \\frac{q(z)}{p(z|x,\\theta)}\\right) + \\sum_z q(z) \\log \\frac{q(z)}{p(z|x,\\theta)} \\\\\n",
" &= \\sum_z q(z) \\log \\frac{p(x,z|\\theta)}{q(z)} + \\underbrace{D_{\\text{KL}}\\left( q(z) \\parallel p(z|x,\\theta) \n",
"\\right)}_{\\text{Kullback-Leibler div.}} \\tag{1} \\\\\n",
" &\\geq \\sum_z q(z) \\log \\frac{p(x,z|\\theta)}{q(z)} \\quad \\text{(use Gibbs inequality)} \\\\\n",
" &= \\underbrace{\\sum_z q(z) \\log p(x,z|\\theta)}_{\\text{expected complete-data log-likelihood}} + \\underbrace{\\mathcal{H}\\left[ q\\right]}_{\\text{entropy of }q} \\\\\n",
"&\\triangleq \\mathrm{LB}(q,\\theta) \n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Hence, $\\mathrm{LB}(q,\\theta)$ is a **lower bound** on the log-likelihood $\\log p(x|\\theta)$. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Technically, the Expectation-Maximization (EM) algorithm is defined by coordinate ascent on $\\mathrm{LB}(q,\\theta)$:\n",
"\n",
"$$\\begin{align*}\n",
" &\\textrm{ } \\\\\n",
" &\\textrm{Initialize }: \\theta^{(0)}\\\\\n",
" &\\textrm{for }m = 1,2,\\ldots \\textrm{until convergence}\\\\\n",
" &\\quad q^{(m+1)} = \\arg\\max_q \\mathrm{LB}(q,\\theta^{(m)}) &\\quad \\text{% update responsibilities} \\\\\n",
" &\\quad \\theta^{(m+1)} = \\arg\\max_\\theta \\mathrm{LB}(q^{(m+1)},\\theta) &\\quad \\text{% re-estimate parameters}\n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Since $\\mathrm{LB}(q,\\theta) \\leq \\mathrm{L}(\\theta)$ (for all choices of $q(z)$), maximizing the lower-bound $\\mathrm{LB}$ will also maximize the log-likelihood wrt $\\theta$. The _reason_ to maximize $\\mathrm{LB}$ rather than log-likelihood $\\mathrm{L}$ directly is that $\\arg\\max \\mathrm{LB}$ often leads to easier expressions. E.g., see this illustrative figure (Bishop p.453):\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### EM Details"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"##### Maximizing $\\mathrm{LB}$ w.r.t. $q$ \n",
"\n",
"- Note that\n",
"$$\n",
"\\mathrm{LB}(q,\\theta) = \\mathrm{L}(\\theta) - D_{\\text{KL}}\\left( q(z) \\parallel p(z|x,\\theta) \\right)\n",
"$$\n",
"and consequenty, maximizing $\\mathrm{LB}$ over $q$ leads to minimization of the KL-divergence. Specifically, it follows from Gibbs inequality that\n",
"$$\n",
" q^{(m+1)}(z) := p(z|x,\\theta^{(m)})\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"##### Maximizing $\\mathrm{LB}$ w.r.t. $\\theta$\n",
"\n",
"- It also follows from (the last line of) the multi-line derivation above that maximizing $\\mathrm{LB}$ w.r.t. $\\theta$ amounts to maximization of the _expected complete-data log-likelihood_ (where the complete data set is defined as $\\{(x_i,z_i)\\}_{i=1}^N$). Hence, the EM algorithm comprises iterations of\n",
"$$\n",
"\\boxed{\\textbf{EM}:\\, \\theta^{(m+1)} := \\underbrace{\\arg\\max_\\theta}_{\\text{M-step}} \\underbrace{\\sum_z \\overbrace{p(z|x,\\theta^{(m)})}^{=q^{(m+1)}(z)} \\log p(x,z|\\theta)}_{\\text{E-step}} }\n",
"$$\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"### Hoemwork exercise: EM for GMM Revisited\n",
"\n",
"##### E-step\n",
"- Write down the GMM generative model\n",
"- The complete-data set is $D_c=\\{x_1,z_1,x_2,z_2,\\ldots,x_n,z_n\\}$. Write down the _complete-data_ likelihood $p(D_c|\\theta)$\n",
"- Write down the complete-data _log_-likelihood $\\log p(D_c|\\theta)$\n",
"- Write down the _expected_ complete-data log-likelihood $\\mathrm{E}_Z\\left[ \\log p(D_c|\\theta) \\right]$\n",
"\n",
"\n",
"##### M-step\n",
"- Maximize $\\mathrm{E}_Z\\left[ \\log p(D_c|\\theta) \\right]$ w.r.t. $\\theta=\\{\\pi,\\mu,\\Sigma\\}$\n",
"\n",
"- Verify that your solution is the same as the 'intuitive' solution of the previous lesson. "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Homework Exercise: EM for Three Coins problem\n",
"\n",
"- You have three coins in your pocket. For each coin, outcomes $\\in \\{\\mathrm{H},\\mathrm{T}\\}$.\n",
"$$\n",
"p(\\mathrm{H}) = \\begin{cases} \\lambda & \\text{for coin }0 \\\\\n",
" \\rho & \\text{for coin }1 \\\\\n",
" \\theta & \\text{for coin }2 \\end{cases}\n",
"$$\n",
"\n",
" \n",
"- **Scenario**. Toss coin $0$. If Head comes up, toss three times with coin $1$; otherwise, toss three times with coin $2$.\n",
"\n",
"- The observed sequences **after** each toss with coin $0$ were $\\langle \\mathrm{HHH}\\rangle$, $\\langle \\mathrm{HTH}\\rangle$, $\\langle \\mathrm{HHT}\\rangle$, and $\\langle\\mathrm{HTT}\\rangle$\n",
"\n",
"- **Task**. Use EM to estimate most the likely values for $\\lambda$, $\\rho$ and $\\theta$\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Report Card on EM\n",
"\n",
"- EM is a general procedure for learning in the presence of unobserved variables."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- In a sense, it is a **family of algorithms**. The update rules you will derive depend on the probability model assumed."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- (Good!) **No tuning parameters** such a learning rate, unlike gradient descent-type algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- (Bad). EM is an iterative procedure that is very sensitive to initial\n",
"conditions! EM converges to a **local optimum**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Start from trash $\\rightarrow$ end up with trash. Hence, we need a good and fast initialization procedure (often used: K-Means, see Bishop p.424)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Also used to train HMMs, etc."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### (OPTIONAL SLIDE) The Free-Energy Principle\n",
"\n",
"- The negative lower-bound $\\mathrm{F}(q,\\theta) \\triangleq -\\mathrm{LB}(q,\\theta)$ also appears in various scientific disciplines. In statistical physics and variational calculus, $F$ is known as the **free energy** functional. Hence, the EM algorithm is a special case of **free energy minimization**. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- It follows from Eq.1 above that\n",
"\n",
"$$\\begin{align*}\n",
"\\mathrm{F}(q,\\theta) &\\triangleq \\sum_z q(z) \\log \\frac{q(z)}{p(x,z|\\theta)} \\\\\n",
" &= \\underbrace{- \\mathrm{L}(\\theta)}_{-\\text{log-likelihood}} + \\underbrace{D_{\\text{KL}}\\left( q(z) \\parallel p(z|x,\\theta) \n",
"\\right)}_{\\text{divergence}}\n",
"\\end{align*}$$ "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- The **Free-Energy Principle** (FEP) is an influential [neuroscientific theory](https://en.wikipedia.org/wiki/Free_energy_principle) that claims that information processing in the brain is also an example of free-energy minimization, see [Friston, 2009](./files/Friston-2009-The-free-energy-principle-a-rough-guide-to-the-brain.pdf)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- According to FEP, the brain contains a generative model $p(x,z,a,\\theta)$ for its environment. Here, $x$ are the sensory signals (observations); $z$ corresponds to (hidden) environmental causes of the sensorium; $a$ represents our actions (control signals for movement) and $\\theta$ describes the fixed 'physical rules' of the world."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### (OPTIONAL SLIDE) The Free-Energy Principle, cont'd\n",
"\n",
"- Solely through free-energy minimization, the brain infers an approximate posterior $q(z,a,\\theta)$, thus inferring our **perception**, **actions**, and **learning** equations."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Free-energy can be interpreted as (a generalized notion of sum of) prediction errors. Free-energy minimization aims to minimize prediction errors through perception, actions, and learning. (The next picture is from a [2012 tutorial presentation](http://slideplayer.com/slide/9925565/) by Karl Friston) "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### (OPTIONAL SLIDE) The Free-Energy Principle, cont'd\n",
"\n",
"- $\\Rightarrow$ The brain is \"nothing but\" an approximate Bayesian agent that tries to build a model for its world. Actions (behavior) are selected to fulfill prior expectations about the world. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- In our [BIASlab research team](http://biaslab.org) in the Signal Processing Systems group (FLUX floor 7), we work on developing (approximately Bayesian) **artificial** intelligent agents that learn purposeful behavior through interactions with their environments, inspired by the free-energy principle. Applications include\n",
" - robotics\n",
" - self-learning of new signal processing algorithms, e.g., for hearing aids\n",
" - self-learning how to drive\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Let me know (email bert.de.vries@tue.nl) if you're interested in developing machine learning applications that are directly inspired by how the brain computes. We often have open traineeships or MSc thesis projects available. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"open(\"../../styles/aipstyle.html\") do f\n",
" display(\"text/html\", read(f, String))\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Julia 1.1.0",
"language": "julia",
"name": "julia-1.1"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.1.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}