{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section3_1-Expectation-Maximization.ipynb)\n", "\n", "# Expectation Maximization Algorithm\n", "\n", "Expectation maximization (EM, Dempster et al. 1977) uses iterative optimization along with a latent variable model to obtain maximum likelihood estimates for models whose parameters are difficult to estimate directly. The algorithm was motivated by missing data imputation. However, the missing values may be deliberately introduced to the problem, as a conceptual ploy that simplifies the obtaining of a solution.\n", "\n", "It may not be intuitive how introducing latent (missing) elements to a problem will facilitate its solution, but it works essentially by breaking the optimization into two steps:\n", "\n", "1. generating an **expectation** over the missing variable(s) based on current estimates of parameters\n", "2. **maximizing** the log-likelihood from the expectation step, thereby generating updated estimates of parameters\n", "\n", "EM is particularly suited to estimating the parameters of *mixture models*, where we do not know from which component each observation is derived." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, suppose we have observed quantities $x = x_1,\\ldots,x_n$ and unobserved (latent) quantities $z = z_1,\\ldots,z_m$ that are derived from some joint model:\n", "\n", "$$y = (x,z) \\sim P(x,z|\\theta)$$\n", "\n", "We are interested in obtaining the MLE for the marginal distribution of $X$:\n", "\n", "$$x \\sim P(x|\\theta)$$\n", "\n", "However, it is difficult to marginalize over $Z$ and maximize. EM gets around this by iteratively improving an initial estimate $\\theta^{(0)}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Mixture of normals\n", "\n", "Consider a set of observations, each of which has been drawn from one of two populations:\n", "\n", "$$x^{(a)} \\sim N(\\mu_a, \\sigma^2_a)$$\n", "$$x^{(b)} \\sim N(\\mu_b, \\sigma^2_b)$$\n", "\n", "except we only observe the values for $x = [x^{(a)}, x^{(b)}]$, not the labels which identify which population they are derived from." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "np.random.seed(77)\n", "\n", "# True parameter values\n", "mu_true = np.array([2, 5])\n", "sigma_true = np.array([0.5, 1])\n", "psi_true = .6\n", "n = 100\n", "\n", "# Simulate from each distribution according to mixing proportion psi\n", "z = np.random.binomial(1, psi_true, n)\n", "x = np.random.normal(mu_true[z], sigma_true[z])\n", "\n", "_ = plt.hist(x, bins=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The observed data is then a finite mixture of normal distributions:\n", "\n", "$$X = (1 - \\psi)X^{(a)} + \\psi X^{(b)}$$\n", "\n", "This is a generative representation of the data, whereby unobserved labels $z_i$ are generated according to probability $\\psi$. We might try to maximize the log likelihood of the joint distribution above, via maximum likelihood:\n", "\n", "$$l(\\theta) = \\sum_i \\log\\left[(1 - \\psi)\\phi^{(a)}(x_i) + \\psi \\phi^{(b)}(x_i)\\right] $$\n", "\n", "$$\\text{where } \\theta = \\{\\psi, \\mu^{(a)}, \\sigma^{(a)}, \\mu^{(b)}, \\sigma^{(b)}\\}$$\n", "\n", "However, this function is very difficult to maximize, and turns out to be bimodal. A simpler approach is to consider the data labels to be unobserved data, and incorporate them into the model. This is generally called a *data augmentation* approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The joint distribution of $x$ and $z$ can be factorized into:\n", "\n", "$$P(x_i,z_i) = P(x_i \\,|\\, z_i)P(z_i)$$\n", "\n", "It is reasonable to model $z$ as:\n", "\n", "$$\\{z_i\\} \\sim \\text{Bernoulli}(\\psi)$$\n", "\n", "where $\\psi$ is the probability of membership in group \"b\" (hence, $1-\\psi$ is the probability of group \"a\" membership). Note that this generalizes to $k$ components in the mixture, where $z_i \\sim \\text{Multinomial}(\\psi)$ with $\\psi$ of dimension $k-1$.\n", "\n", "Clearly, the distribution of $x$ conditional on $z$ is:\n", "\n", "$$(x_i | z_i = j) \\sim N(\\mu_j, \\sigma_j)$$\n", "\n", "If we knew the $\\{z_i\\}$, then we could simply use MLE to obtain estimates for the paramters of the model. However, we do not know the labels, which makes this a form of **unsupervised learning**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Algorithm\n", "\n", "Continuing with the mixture of normals model as our example, we can apply the EM algorithm to estimate $\\theta = \\{\\mu, \\sigma, \\psi\\}$.\n", "\n", "> **Initialize** $\\theta_0 = \\{\\mu_0, \\sigma_0, \\psi_0\\}$\n", "> \n", "> **Repeat until convergence:**\n", "> \n", "> - **E-step**: guess the values of $\\{z_i\\}$\n", "> \n", "> Compute probabilities of group membership: $w_{ij} = P(z_i = j | x_i, \\theta)$ for each group $j=1,\\ldots,k$. This is done via Bayes' formula:\n", "> \n", "> $$P(z_i = j | x_i) = \\frac{P(x_i | z_i=j) P(z_i=j)}{\\sum_{l=1}^k P(x_i | z_i=l) P(z_i=l)}$$\n", "> \n", "> $\\theta$ has been dropped for notational convenience.\n", "> \n", "> - **M-step**: update estimates of parameters $\\theta$\n", "> \n", "> $$\\begin{aligned}\\psi_j &= \\frac{1}{n} \\sum_i w_{ij} \\\\\n", " \\mu_j &= \\frac{\\sum_i w_{ij} x_i}{\\sum_i w_{ij}} \\\\\n", " \\sigma_j^2 &= \\frac{\\sum_i w_{ij}(x_i - \\mu_j)^2}{\\sum_i w_{ij}}\n", " \\end{aligned}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### General formulation\n", "\n", "Recall [**Jensen's inequality**](http://mathworld.wolfram.com/JensensInequality.html):\n", "\n", "> Let $f$ be a convex function (*i.e.* $f^{\\prime\\prime} \\ge 0$) of a random variable X. Then:\n", "> $f(E[X]) \\le E[f(X)]$\n", "\n", "And when $f$ is *strictly* convex, then:\n", "\n", "$$E[f(X)] = f(E[X]) \\iff X = E[X]$$\n", "\n", "with probability 1.\n", "\n", "Consider again the joint density $P(x,z|\\theta)$, where only $x$ is observed. We want to be able to maximize:\n", "\n", "$$\\begin{aligned}\n", "l(x \\,|\\, \\theta) &= \\sum_i \\log P(x_i \\,|\\, \\theta) \\\\\n", "&= \\sum_i \\log \\sum_{z_i} P(x_i, z_i \\,|\\, \\theta)\n", "\\end{aligned}$$\n", "\n", "however, evaluating this is difficult when the $\\{z_i\\}$ are unobserved.\n", "\n", "The EM algorithm iteratively calculates *lower bounds on the likelihood* for the current values of the parameters, then *maximizes the lower bound* to update the parameters.\n", "\n", "Since $z_i$ is a random variable, perhaps we can construct its density $Q_i$ and use it to marginalize the joint likelihood:\n", "\n", "$$\\sum_i \\log \\sum_{z_i} P(x_i, z_i \\,|\\, \\theta) = \\sum_i \\log \\sum_{z_i} Q_i(z_i) \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)}$$\n", "\n", "This turns the inner summation into an expectation.\n", "\n", "$$\\sum_i \\log \\sum_{z_i} Q_i(z_i) \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} = \\sum_i \\log E_{Q_i} \\left[ \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} \\right]$$\n", "\n", "Now, if we apply Jensen's inequality (note that the logarithm is a *concave* function, so the inequality is reversed):\n", "\n", "$$\\begin{aligned}\n", "\\sum_i \\log E_{Q_i} \\left[ \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} \\right] &\\ge \\sum_i E_{Q_i} \\log \\left[ \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} \\right] \\\\\n", "&= \\sum_i \\sum_{z_i} Q_i(z_i) \\log \\left[ \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} \\right]\n", "\\end{aligned}$$\n", "\n", "We need to ensure that the equality condition holds true, which we can do by choosing $Q_i$ appropriately. Specifically, we want a $Q_i$ such that:\n", "\n", "$$\\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} = C$$\n", "\n", "which implies:\n", "\n", "$$Q_i(z_i) \\propto P(x_i, z_i \\,|\\, \\theta)$$\n", "\n", "Since $Q_i$ is a density,\n", "\n", "$$\\begin{aligned}\n", "Q_i(z_i) &= \\frac{P(x_i, z_i \\,|\\, \\theta)}{\\sum_{z_i} P(x_i, z_i \\,|\\, \\theta)} \\\\\n", "&= \\frac{P(x_i, z_i \\,|\\, \\theta)}{P(x_i \\,|\\, \\theta)} \\\\\n", "&= P(z_i \\,|\\, x_i, \\theta)\n", "\\end{aligned}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Returning to our normal mixture example:\n", "\n", "For the **E-step** we need to identify $Q_i(z_i)$\n", "\n", "$$Q_i(z_i) = P(z_i \\,|\\, x_i, \\mu, \\sigma, \\psi)$$\n", "\n", "Via Bayes' formula:\n", "\n", "$$P(z_i=j \\,|\\, x_i) = \\frac{P(x_i \\,|\\, z_i=j)P(z_i=j)}{\\sum_l P(x_i \\,|\\, z_i=l)P(z_i=l)}$$\n", "\n", "where $P(x_i \\,|\\, z_i=j)$ is just the $j$ th Normal distribution of the mixture, and $P(z_i=j)$ is a multinomial probability.\n", "\n", "This gives us:\n", "\n", "$$P(z_i=1 \\,|\\, x_i) = \\frac{\\psi N(\\mu_b, \\sigma_b^2)}{\\psi N(\\mu_a, \\sigma_a^2) + (1-\\psi) N(\\mu_b, \\sigma_b^2)}$$\n", "\n", "(recall that we are encoding `a=0` and `b=1`)\n", "\n", "This can be implemented easily in Python:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats.distributions import norm\n", "\n", "def e_step(x, mu, sigma, psi):\n", " a = psi * norm.pdf(x, mu[0], sigma[0])\n", " b = (1. - psi) * norm.pdf(x, mu[1], sigma[1])\n", " return b / (a + b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "e_step(0.4, mu_true, sigma_true, psi_true)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_range = np.linspace(-5,5)\n", "plt.plot(x_range, e_step(x_range, mu_true, sigma_true, psi_true))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the **M-step** we need to maximize\n", "\n", "$$\\begin{aligned}\\text{argmax}_{\\mu,\\Sigma, \\psi} \\sum_i \\sum_{z_i} Q_i(z_i) \\log \\left[ \\frac{P(x_i, z_i \\,|\\, \\theta)}{Q_i(z_i)} \\right] \\\\\n", "= \\sum_i \\sum_{z_i} w_{ij} \\log \\left[\\frac{1}{\\sqrt{2 \\pi} \\, |\\Sigma_j|^{1/2} \\, w_{ij}} e^{-\\frac{1}{2} (x_i - \\mu_j) \\Sigma^{-1} (x_i - \\mu_j))} \\psi_j\\right]\n", "\\end{aligned}$$\n", "\n", "which we can show is\n", "\n", "$$\\begin{aligned}\\psi_j &= \\frac{1}{n} \\sum_i w_{ij} \\\\\n", "\\mu_j &= \\frac{\\sum_i w_{ij} x_i}{\\sum_i w_{ij}} \\\\\n", "\\sigma_j^2 &= \\frac{\\sum_i w_{ij}(x_i - \\mu_j)^2}{\\sum_i w_{ij}}\n", "\\end{aligned}$$\n", "\n", "This can be coded into Python as `m_step`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def m_step(x, w):\n", " psi = np.mean(w) \n", " \n", " mu = [np.sum((1-w) * x)/np.sum(1-w), np.sum(w * x)/np.sum(w)]\n", " \n", " sigma = [np.sqrt(np.sum((1-w) * (x - mu[0])**2)/np.sum(1-w)), \n", " np.sqrt(np.sum(w * (x - mu[1])**2)/np.sum(w))]\n", " \n", " return mu, sigma, psi" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize values\n", "mu = np.random.normal(size=2)\n", "sigma = np.random.uniform(0, 10, 2)\n", "psi = np.random.random()\n", "\n", "# Stopping criterion\n", "crit = 1e-4\n", "\n", "# Convergence flag\n", "converged = False\n", "\n", "# Loop until converged\n", "while not converged:\n", " \n", " # E-step\n", " w = e_step(x, mu, sigma, psi)\n", " # M-step\n", " mu_new, sigma_new, psi_new = m_step(x, w)\n", " \n", " # Check convergence\n", " converged = ((np.abs(psi_new - psi) < crit) \n", " & np.all(np.abs((np.array(mu_new) - np.array(mu)) < crit))\n", " & np.all(np.abs((np.array(sigma_new) - np.array(sigma)) < crit)))\n", " mu, sigma, psi = mu_new, sigma_new, psi_new\n", " \n", "print('A: N({0:.4f}, {1:.4f})\\nB: N({2:.4f}, {3:.4f})\\npsi: {4:.4f}'.format(\n", " mu_new[0], sigma_new[0], mu_new[1], sigma_new[1], psi_new))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Zero-inflated data\n", "\n", "Code the EM algorithm to estimate the paramters of a zero-inflated Poisson (ZIP) model.\n", "\n", "$$\\begin{aligned}\n", "f(x \\mid \\theta, \\psi) &\\sim& \\left\\{ \\begin{array}{l}\n", " \\text{Poisson}(\\theta) \\text{ w.p. } \\psi \\\\\n", " 0 \\text{ w.p. } 1-\\psi\n", " \\end{array} \\right. \\\\\n", "&=& \\left\\{ \\begin{array}{l}\n", " \\psi \\frac{e^{-\\theta}\\theta^x}{x!}, \\text{if } x=1,2,3,\\ldots \\\\\n", " (1-\\psi) + \\psi e^{-\\theta}, \\text{if } x = 0\n", " \\end{array} \\right.\n", "\\end{aligned}$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# True parameter values\n", "mu_true = 1.5\n", "psi_true = .4\n", "n = 100\n", "\n", "# Simulate some data\n", "data = np.array([np.random.poisson(mu_true)*(np.random.random()