{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# EM Algorithm (KL Chapter 13)\n", "\n", "## Which statistical papers are most cited?\n", "\n", "| Paper | Citations (5/15/2017) | Per Year |\n", "|------------------------------------------|-----------:|----------:|\n", "| Kaplan-Meier ([Kaplan and Meier, 1958](https://scholar.google.com/scholar?q=Nonparametric+estimation+from+incomplete+observations&hl=en)) | 51361 | 871 |\n", "| EM algorithm ([Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en)) | 50074 | **1252** |\n", "| Cox model ([Cox, 1972](https://scholar.google.com/scholar?q=Regression+models+and+life-tables&hl=en)) | 44995 | **1000** |\n", "| FDR ([Benjamini and Hochberg, 1995](https://scholar.google.com/scholar?q=Controlling+the+false+discovery+rate)) | 39323 | **1787** |\n", "| Metropolis algorithm ([Metropolis et al., 1953](https://scholar.google.com/scholar?q=Equation+of+state+calculations+by+fast+computing+machines)) | 34905 | 545 |\n", "| Unit root test ([Dickey and Fuller, 1979](https://scholar.google.com/scholar?q=Distribution+of+the+estimators+for+autoregressive+time+series+with+a+unit+root)) | 21406 | 563 |\n", "| lasso ([Tibshrani, 1996](https://scholar.google.com/scholar?q=Regression+shrinkage+and+selection+via+the+lasso)) | 19969 | 951 |\n", "| bootstrap ([Efron, 1979](https://scholar.google.com/scholar?q=Bootstrap+methods%3A+another+look+at+the+jackknife)) | 15033 | 396 |\n", "| FFT ([Cooley and Tukey, 1965](https://scholar.google.com/scholar?q=An+algorithm+for+the+machine+calculation+of+complex+Fourier+series)) | 12948 | 249 |\n", "| Gibbs sampler ([Gelfand and Smith, 1990](https://scholar.google.com/scholar?q=Sampling-based+approaches+to+calculating+marginal+densities)) | 7152 | 265 |\n", "\n", "* Citation counts from Google Scholar on May 15, 2017.\n", "\n", "* EM is one of the most influential statistical ideas, finding applications in various branches of science.\n", "\n", "* History: Landmark paper [Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en) on EM algorithm. Same idea appears in parameter estimation in HMM (Baum-Welch algorithm) by [Baum et al., 1970](http://www.jstor.org/stable/2239727?seq=1#page_scan_tab_contents).\n", "\n", "## EM algorithm\n", "\n", "* Goal: maximize the log-likelihood of the observed data $\\ln g(\\mathbf{y}|\\theta)$ (optimization!)\n", "\n", "* Notations: \n", " * $\\mathbf{Y}$: observed data\n", " * $\\mathbf{Z}$: missing data\n", "\n", "* Idea: choose missing $\\mathbf{Z}$ such that MLE for the complete data is easy.\n", "\n", "* Let $f(\\mathbf{y},\\mathbf{z} | \\theta)$ be the density of complete data.\n", "\n", "* Iterative two-step procedure\n", " * **E step**: calculate the conditional expectation\n", "$$\n", "\tQ(\\theta|\\theta^{(t)}) = \\mathbf{E}_{\\mathbf{Z}|\\mathbf{Y}=\\mathbf{y},\\theta^{(t)}} [ \\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}]\n", "$$\n", " * **M step**: maximize $Q(\\theta|\\theta^{(t)})$ to generate the next iterate \n", "$$\n", " \\theta^{(t+1)} = \\text{argmax}_{\\theta} \\, Q(\\theta|\\theta^{(t)})\n", "$$\n", "\n", "* (**Ascent property** of EM algorithm) By the information inequality,\n", "$$\n", "\\begin{eqnarray*}\n", "\t& & Q(\\theta \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y}|\\theta) \\\\\n", "\t&=& \\mathbf{E} [\\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) | \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}] - \\ln g(\\mathbf{y}|\\theta) \\\\\n", "\t&=& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta)}{g(\\mathbf{Y} \\mid \\theta)} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n", "\t&\\le& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta^{(t)})}{g(\\mathbf{Y} \\mid \\theta^{(t)})} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n", "\t&=& Q(\\theta^{(t)} \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y} |\\theta^{(t)}).\n", "\\end{eqnarray*}\n", "$$\n", "Rearranging shows that (**fundamental inequality of EM**)\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\ln g(\\mathbf{y} \\mid \\theta) \\ge Q(\\theta \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)})\n", "\\end{eqnarray*}\n", "$$\n", "for all $\\theta$; in particular\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\ln g(\\mathbf{y} \\mid \\theta^{(t+1)}) &\\ge& Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)}) \\\\\n", "\t&\\ge& \\ln g(\\mathbf{y} \\mid \\theta^{(t)}).\n", "\\end{eqnarray*}\n", "$$\n", "Obviously we only need \n", "$$\n", " Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) \\ge 0\n", "$$\n", "for this ascent property to hold (**generalized EM**).\n", "\n", "* Intuition? Keep these pictures in mind\n", "\n", "\n", "\n", "\n", "\n", "* Under mild conditions, $\\theta^{(t)}$ converges to a stationary point of $\\ln g(\\mathbf{y} \\theta)$.\n", "\n", "* Numerous applications of EM: \n", "finite mixture model, HMM (Baum-Welch algorithm), factor analysis, variance components model aka linear mixed model (LMM), hyper-parameter estimation in empirical Bayes procedure $\\max_{\\alpha} \\int f(\\mathbf{y}|\\theta) \\pi(\\theta|\\alpha) \\, d \\theta$, missing data, group/censorized/truncated model, ...\n", "\n", "See Chapters 13 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) by Kenneth Lange (2010) for more applications of EM." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A canonical EM example: finite mixture models\n", "\n", "\n", "\n", "* Consider Gaussian finite mixture models with density \n", "$$\n", "h(\\mathbf{y}) = \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j), \\quad \\mathbf{y} \\in \\mathbf{R}^{d},\n", "$$\n", "where\n", "$$\n", "\th_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j) = \\left( \\frac{1}{2\\pi} \\right)^{d/2} |\\det(\\Omega_j)|^{-1/2} e^{- \\frac 12 (\\mathbf{y} - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y} - \\mu_j)}\n", "$$ \n", "are multivariate normals $N_d(\\mu_j, \\Omega_j)$. \n", "\n", "* Given data $\\mathbf{y}_1,\\ldots,\\mathbf{y}_n$, want to estimate parameters\n", "$$\n", "\\theta = (\\pi_1, \\ldots, \\pi_k, \\mu_1, \\ldots, \\mu_k, \\Omega_1,\\ldots,\\Omega_k)\n", "$$\n", "subject to constraint $\\pi_j \\ge 0, \\sum_{j=1}^d \\pi_j = 1, \\Omega_j \\succeq \\mathbf{0}$. \n", "\n", "* (Incomplete) data log-likelihood is \n", "$$\n", "\\ln g(\\mathbf{y}_1,\\ldots,\\mathbf{y}_n|\\theta) = \\sum_{i=1}^n \\ln h(\\mathbf{y}_i) = \\sum_{i=1}^n \\ln \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y}_i \\mid \\mu_j, \\Omega_j).\n", "$$\n", "\n", "* Let $z_{ij} = I \\{\\mathbf{y}_i$ comes from group $j \\}$. Complete data likelihood is\n", "$$\n", "\\begin{eqnarray*}\n", "\tf(\\mathbf{y},\\mathbf{z} | \\theta) = \\prod_{i=1}^n \\prod_{j=1}^k [\\pi_j h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)]^{z_{ij}}\n", "\\end{eqnarray*}\n", "$$\n", "and thus complete log-likelihood is\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\ln f(\\mathbf{y},\\mathbf{z} | \\theta) = \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)].\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* **E step**: need to evaluate conditional expectation\n", "$$\n", "\\begin{eqnarray*}\n", "\t& & Q(\\theta|\\theta^{(t)})\n", "\t= \\mathbf{E} \\left\\{ \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)] \\mid \\mathbf{Y}=\\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)} ] \\right\\}.\n", "\\end{eqnarray*}\n", "$$\n", "By Bayes rule, we have\n", "$$\n", "\\begin{eqnarray*}\n", "\tw_{ij}^{(t)} &:=& \\mathbf{E} [z_{ij} \\mid \\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)}] \\\\\n", "\t&=& \\frac{\\pi_j^{(t)} h_j(\\mathbf{y}_i|\\mu_j^{(t)},\\Omega_j^{(t)})}{\\sum_{j'=1}^k \\pi_{j'}^{(t)} h_{j'}(\\mathbf{y}_i|\\mu_{j'}^{(t)},\\Omega_{j'}^{(t)})}.\n", "\\end{eqnarray*}\n", "$$\n", "So the Q function becomes\n", "$$\n", "\\begin{eqnarray*}\n", "\t& & Q(\\theta|\\theta^{(t)})\n", "\t= \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\ln \\pi_j + \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\left[ - \\frac{1}{2} \\ln \\det \\Omega_j - \\frac{1}{2} (\\mathbf{y}_i - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y}_i - \\mu_j) \\right].\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* **M step**: maximizer of the Q function gives the next iterate\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\pi_j^{(t+1)} &=& \\frac{\\sum_i w_{ij}^{(t)}}{n} \\\\\n", "\t\\mu_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} \\mathbf{y}_i}{\\sum_{i=1}^n w_{ij}^{(t)}} \\\\\n", "\t\\Omega_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} (\\mathbf{y}_i - \\mu_j^{(t+1)}) (\\mathbf{y}_i - \\mu_j^{(t+1)})^T}{\\sum_i w_{ij}^{(t)}}.\n", "\\end{eqnarray*}\n", "$$\n", "See KL Example 11.3.1 for multinomial MLE. See KL Example 11.2.3 for multivariate normal MLE.\n", "\n", "* Compare these extremely simple updates to Newton type algorithms!\n", "\n", "* Also note the ease of parallel computing with this EM algorithm. See, e.g., \n", "**Suchard, M. A.**; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M. Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. _Journal of Computational and Graphical Statistics_, 2010, 19, 419-438." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.3" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "85px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }