{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  EM Algorithm (KL Chapter 13)
1.1  Which statistical papers are most cited?
1.2  EM algorithm
1.3  A canonical EM example: finite mixture models
1.4  Generalizations of EM - handle difficulties in M step
1.4.1  Expectation Conditional Maximization (ECM)
1.4.2  ECM Either (ECME)
1.4.3  Alternating ECM (AECM)
1.4.4  PX-EM and efficient data augmentation
1.4.5  Example: t distribution
1.4.5.1  EM
1.4.5.2  ECM and ECME
1.4.5.3  Efficient data augmentation
1.5  Generalizations of EM - handle difficulties in E step
1.5.1  Monte Carlo EM (MCEM)
1.5.2  Stochastic EM (SEM) and Simulated Annealing EM (SAEM)
1.5.3  Data-augmentation (DA) algorithm
1.6  EM as a maximization-maximization procedure
1.7  Incremental EM
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# EM Algorithm (KL Chapter 13)\n", "\n", "## Which statistical papers are most cited?\n", "\n", "| Paper | Citations (5/20/2019) | Per Year |\n", "|------------------------------------------|-----------:|----------:|\n", "| Kaplan-Meier ([Kaplan and Meier, 1958](https://scholar.google.com/scholar?q=Nonparametric+estimation+from+incomplete+observations&hl=en)) | 55672 | 913 |\n", "| EM algorithm ([Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en)) | 55195 | **1314** |\n", "| FDR ([Benjamini and Hochberg, 1995](https://scholar.google.com/scholar?q=Controlling+the+false+discovery+rate)) | 54787 | **2283** |\n", "| Cox model ([Cox, 1972](https://scholar.google.com/scholar?q=Regression+models+and+life-tables&hl=en)) | 49275 | 1048 |\n", "| Metropolis algorithm ([Metropolis et al., 1953](https://scholar.google.com/scholar?q=Equation+of+state+calculations+by+fast+computing+machines)) | 39181 | 594 |\n", "| lasso ([Tibshrani, 1996](https://scholar.google.com/scholar?q=Regression+shrinkage+and+selection+via+the+lasso)) | 28074 | **1221** |\n", "| Unit root test ([Dickey and Fuller, 1979](https://scholar.google.com/scholar?q=Distribution+of+the+estimators+for+autoregressive+time+series+with+a+unit+root)) | 24947 | 624 |\n", "| bootstrap ([Efron, 1979](https://scholar.google.com/scholar?q=Bootstrap+methods%3A+another+look+at+the+jackknife)) | 17626 | 441 |\n", "| FFT ([Cooley and Tukey, 1965](https://scholar.google.com/scholar?q=An+algorithm+for+the+machine+calculation+of+complex+Fourier+series)) | 14635 | 271 |\n", "| Gibbs sampler ([Gelfand and Smith, 1990](https://scholar.google.com/scholar?q=Sampling-based+approaches+to+calculating+marginal+densities)) | 7884 | 272 |\n", "\n", "* Citation counts from Google Scholar on May 20, 2019.\n", "\n", "* EM is one of the most influential statistical ideas, finding applications in various branches of science.\n", "\n", "* History: Landmark paper [Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en) on EM algorithm. Same idea appears in parameter estimation in HMM (Baum-Welch algorithm) by [Baum et al., 1970](http://www.jstor.org/stable/2239727?seq=1#page_scan_tab_contents).\n", "\n", "## EM algorithm\n", "\n", "* Goal: maximize the log-likelihood of the observed data $\\ln g(\\mathbf{y}|\\theta)$ (optimization!)\n", "\n", "* Notations: \n", " * $\\mathbf{Y}$: observed data\n", " * $\\mathbf{Z}$: missing data\n", "\n", "* Idea: choose missing $\\mathbf{Z}$ such that MLE for the complete data is easy.\n", "\n", "* Let $f(\\mathbf{y},\\mathbf{z} | \\theta)$ be the density of complete data.\n", "\n", "* Iterative two-step procedure\n", " * **E step**: calculate the conditional expectation\n", "$$\n", "\tQ(\\theta|\\theta^{(t)}) = \\mathbf{E}_{\\mathbf{Z}|\\mathbf{Y}=\\mathbf{y},\\theta^{(t)}} [ \\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}]\n", "$$\n", " * **M step**: maximize $Q(\\theta|\\theta^{(t)})$ to generate the next iterate \n", "$$\n", " \\theta^{(t+1)} = \\text{argmax}_{\\theta} \\, Q(\\theta|\\theta^{(t)})\n", "$$\n", "\n", "* (**Ascent property** of EM algorithm) By the information inequality,\n", "$$\n", "\\begin{eqnarray*}\n", "\t& & Q(\\theta \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y}|\\theta) \\\\\n", "\t&=& \\mathbf{E} [\\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) | \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}] - \\ln g(\\mathbf{y}|\\theta) \\\\\n", "\t&=& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta)}{g(\\mathbf{Y} \\mid \\theta)} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n", "\t&\\le& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta^{(t)})}{g(\\mathbf{Y} \\mid \\theta^{(t)})} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n", "\t&=& Q(\\theta^{(t)} \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y} |\\theta^{(t)}).\n", "\\end{eqnarray*}\n", "$$\n", "Rearranging shows that (**fundamental inequality of EM**)\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\ln g(\\mathbf{y} \\mid \\theta) \\ge Q(\\theta \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)})\n", "\\end{eqnarray*}\n", "$$\n", "for all $\\theta$; in particular\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\ln g(\\mathbf{y} \\mid \\theta^{(t+1)}) &\\ge& Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)}) \\\\\n", "\t&\\ge& \\ln g(\\mathbf{y} \\mid \\theta^{(t)}).\n", "\\end{eqnarray*}\n", "$$\n", "Obviously we only need \n", "$$\n", " Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) \\ge 0\n", "$$\n", "for this ascent property to hold (**generalized EM**).\n", "\n", "* Intuition? Keep these pictures in mind\n", "\n", "\n", "\n", "\n", "\n", "* Under mild conditions, $\\theta^{(t)}$ converges to a stationary point of $\\ln g(\\mathbf{y} \\mid \\theta)$.\n", "\n", "* Numerous applications of EM: \n", "finite mixture model, HMM (Baum-Welch algorithm), factor analysis, variance components model aka linear mixed model (LMM), hyper-parameter estimation in empirical Bayes procedure $\\max_{\\alpha} \\int f(\\mathbf{y}|\\theta) \\pi(\\theta|\\alpha) \\, d \\theta$, missing data, group/censorized/truncated model, ...\n", "\n", "See Chapters 13 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) by Kenneth Lange (2010) for more applications of EM." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A canonical EM example: finite mixture models\n", "\n", "\n", "\n", "* Consider Gaussian finite mixture models with density \n", "$$\n", "h(\\mathbf{y}) = \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j), \\quad \\mathbf{y} \\in \\mathbf{R}^{d},\n", "$$\n", "where\n", "$$\n", "\th_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j) = \\left( \\frac{1}{2\\pi} \\right)^{d/2} \\det(\\Omega_j)^{-1/2} e^{- \\frac 12 (\\mathbf{y} - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y} - \\mu_j)}\n", "$$ \n", "are densities of multivariate normals $N_d(\\mu_j, \\Omega_j)$. \n", "\n", "* Given iid data points $\\mathbf{y}_1,\\ldots,\\mathbf{y}_n$, want to estimate parameters\n", "$$\n", "\\theta = (\\pi_1, \\ldots, \\pi_k, \\mu_1, \\ldots, \\mu_k, \\Omega_1,\\ldots,\\Omega_k)\n", "$$\n", "subject to constraint $\\pi_j \\ge 0, \\sum_{j=1}^d \\pi_j = 1, \\Omega_j \\succeq \\mathbf{0}$. \n", "\n", "* (Incomplete) data log-likelihood is \n", "\\begin{eqnarray*}\n", " \\ln g(\\mathbf{y}_1,\\ldots,\\mathbf{y}_n|\\theta) &=& \\sum_{i=1}^n \\ln h(\\mathbf{y}_i) = \\sum_{i=1}^n \\ln \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y}_i \\mid \\mu_j, \\Omega_j) \\\\\n", " &=& \\sum_{i=1}^n \\ln \\left[ \\sum_{j=1}^k \\pi_j \\left( \\frac{1}{2\\pi} \\right)^{d/2} \\det(\\Omega_j)^{-1/2} e^{- \\frac 12 (\\mathbf{y}_i - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y}_i - \\mu_j)} \\right].\n", "\\end{eqnarray*}\n", "\n", "* Let $z_{ij} = I \\{\\mathbf{y}_i$ comes from group $j \\}$ be the missing data. The complete data likelihood is\n", "$$\n", "\\begin{eqnarray*}\n", "\tf(\\mathbf{y},\\mathbf{z} | \\theta) = \\prod_{i=1}^n \\prod_{j=1}^k [\\pi_j h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)]^{z_{ij}}\n", "\\end{eqnarray*}\n", "$$\n", "and thus the complete log-likelihood is\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\ln f(\\mathbf{y},\\mathbf{z} | \\theta) = \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)].\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* **E step**: need to evaluate conditional expectation\n", "$$\n", "\\begin{eqnarray*}\n", "\t& & Q(\\theta|\\theta^{(t)})\n", "\t= \\mathbf{E} \\left\\{ \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)] \\mid \\mathbf{Y}=\\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)} ] \\right\\}.\n", "\\end{eqnarray*}\n", "$$\n", "By Bayes rule, we have\n", "$$\n", "\\begin{eqnarray*}\n", "\tw_{ij}^{(t)} &:=& \\mathbf{E} [z_{ij} \\mid \\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)}] \\\\\n", "\t&=& \\frac{\\pi_j^{(t)} h_j(\\mathbf{y}_i|\\mu_j^{(t)},\\Omega_j^{(t)})}{\\sum_{j'=1}^k \\pi_{j'}^{(t)} h_{j'}(\\mathbf{y}_i|\\mu_{j'}^{(t)},\\Omega_{j'}^{(t)})}.\n", "\\end{eqnarray*}\n", "$$\n", "So the Q function becomes\n", "$$\n", "\\begin{eqnarray*}\n", "\t& & Q(\\theta|\\theta^{(t)})\n", "\t= \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\ln \\pi_j + \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\left[ - \\frac{1}{2} \\ln \\det \\Omega_j - \\frac{1}{2} (\\mathbf{y}_i - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y}_i - \\mu_j) \\right].\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* **M step**: maximizer of the Q function gives the next iterate\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\pi_j^{(t+1)} &=& \\frac{\\sum_i w_{ij}^{(t)}}{n} \\\\\n", "\t\\mu_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} \\mathbf{y}_i}{\\sum_{i=1}^n w_{ij}^{(t)}} \\\\\n", "\t\\Omega_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} (\\mathbf{y}_i - \\mu_j^{(t+1)}) (\\mathbf{y}_i - \\mu_j^{(t+1)})^T}{\\sum_i w_{ij}^{(t)}}.\n", "\\end{eqnarray*}\n", "$$\n", "See KL Example 11.3.1 for multinomial MLE. See KL Example 11.2.3 for multivariate normal MLE.\n", "\n", "* Compare these extremely simple updates to Newton type algorithms!\n", "\n", "* Also note the ease of parallel computing with this EM algorithm. See, e.g., \n", "**Suchard, M. A.**; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M. Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. _Journal of Computational and Graphical Statistics_, 2010, 19, 419-438." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalizations of EM - handle difficulties in M step\n", "\n", "### Expectation Conditional Maximization (ECM) \n", "\n", "* [Meng and Rubin (1993)](https://scholar.google.com/scholar?q=Maximum+likelihood+estimation+via+the+ECM+algorithm:+a+general+framework&hl=en&as_sdt=0&as_vis=1&oi=scholart).\n", "\n", "* In some problems the M step is difficult (no analytic solution).\n", "\n", "* Conditional maximization is easy (block ascent).\n", " * partition parameter vector into blocks $\\theta = (\\theta_1, \\ldots, \\theta_B)$. \n", " * alternately update $\\theta_b$, $b=1,\\ldots,B$.\n", "\n", "\n", "\n", "* Ascent property still holds. Why?\n", "\n", "* ECM may converge slower than EM (more iterations) but the total computer time may be shorter due to ease of the CM step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ECM Either (ECME)\n", "\n", "* [Liu and Rubin (1994)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=The+ECME+algorithm%3A+a+simple+extension+of+EM+and+ECM+with+faster+monotone+convergence&btnG=).\n", "\n", "* Each CM step maximizes either the $Q$ function or the original incomplete observed log-likelihood.\n", "\n", "* Ascent property still holds. Why?\n", "\n", "* Faster convergence than ECM." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternating ECM (AECM)\n", "\n", "* [Meng and van Dyk (1997)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=The+EM+algorithm+an+old+folk-song+sung+to+a+fast+new+tune&btnG=).\n", "\n", "* The specification of the complete-data is allowed to be different on each CM-step.\n", "\n", "* Ascent property still holds. Why?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PX-EM and efficient data augmentation\n", "\n", "* Parameter-Expanded EM. [Liu, Rubin, and Wu (1998)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Parameter+expansion+to+accelerate+EM%3A+the+px-em+algorithm&btnG=), [Liu and Wu (1999)](https://scholar.google.com/scholar?cluster=15342818142955984734&hl=en&as_sdt=0,5).\n", "\n", "* Efficient data augmentation. [Meng and van Dyk (1997)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=The+EM+algorithm+an+old+folk-song+sung+to+a+fast+new+tune&btnG=).\n", "\n", "* Idea: Speed up the convergence of EM algorithm by efficiently augmenting the observed data (introducing a working parameter in the specification of complete data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: t distribution\n", "\n", "#### EM\n", "\n", "* $\\mathbf{W} \\in \\mathbb{R}^{p}$ is a multivariate $t$-distribution $t_p(\\mu,\\Sigma,\\nu)$ if $\\mathbf{W} \\sim N(\\mu, \\Sigma/u)$ and $u \\sim \\text{gamma}\\left(\\frac{\\nu}{2}, \\frac{\\nu}{2}\\right)$.\n", "\n", "* Widely used for robust modeling.\n", "\n", "* Recall Gamma($\\alpha,\\beta$) has density\n", "\\begin{eqnarray*}\n", "\tf(u \\mid \\alpha,\\beta) = \\frac{\\beta^{\\alpha} u^{\\alpha-1}}{\\Gamma(\\alpha)} e^{-\\beta u}, \\quad u \\ge 0,\n", "\\end{eqnarray*}\n", "with mean $\\mathbf{E}(U)=\\alpha/\\beta$, and $\\mathbf{E}(\\log U) = \\psi(\\alpha) - \\log \\beta$.\n", "\n", "* Density of $\\mathbf{W}$ is\n", "\\begin{eqnarray*}\n", "\tf_p(\\mathbf{w} \\mid \\mu, \\Sigma, \\nu) = \\frac{\\Gamma \\left(\\frac{\\nu+p}{2}\\right) \\text{det}(\\Sigma)^{-1/2}}{(\\pi \\nu)^{p/2} \\Gamma(\\nu/2) [1 + \\delta (\\mathbf{w},\\mu;\\Sigma)/\\nu]^{(\\nu+p)/2}},\n", "\\end{eqnarray*}\n", "where\n", "\\begin{eqnarray*}\n", " \\delta (\\mathbf{w},\\mu;\\Sigma) = (\\mathbf{w} - \\mu)^T \\Sigma^{-1} (\\mathbf{w} - \\mu)\n", "\\end{eqnarray*}\n", "is the Mahalanobis squared distance between $\\mathbf{w}$ and $\\mu$.\n", "\n", "* Given iid data $\\mathbf{w}_1, \\ldots, \\mathbf{w}_n$, the log-likelihood is\n", "\\begin{eqnarray*}\n", "\tL(\\mu, \\Sigma, \\nu) &=& - \\frac{np}{2} \\log (\\pi \\nu) + n \\left[ \\log \\Gamma \\left(\\frac{\\nu+p}{2}\\right) - \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) \\right]\t\\\\\n", "\t&& - \\frac n2 \\log \\det \\Sigma + \\frac n2 (\\nu + p) \\log \\nu \\\\\n", "\t&& - \\frac{\\nu + p}{2} \\sum_{j=1}^n \\log [\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma) ]\n", "\\end{eqnarray*}\n", "How to compute the MLE $(\\hat \\mu, \\hat \\Sigma, \\hat \\nu)$?\n", "\n", "* $\\mathbf{W}_j | u_j$ independent $N(\\mu,\\Sigma/u_j)$ and $U_j$ iid gamma $\\left(\\frac{\\nu}{2},\\frac{\\nu}{2}\\right)$.\n", "\n", "* Missing data: $\\mathbf{z}=(u_1,\\ldots,u_n)^T$.\n", "\n", "* Log-likelihood of complete data is\n", "\\begin{eqnarray*}\n", "\tL_c (\\mu, \\Sigma, \\nu) &=& - \\frac{np}{2} \\ln (2\\pi) - \\frac{n}{2} \\log \\det \\Sigma \\\\\n", "\t& & - \\frac 12 \\sum_{j=1}^n u_j (\\mathbf{w}_j - \\mu)^T \\Sigma^{-1} (\\mathbf{w}_j - \\mu) \\\\\n", "\t& & - n \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) + \\frac{n \\nu}{2} \\log \\left(\\frac{\\nu}{2}\\right) \\\\\n", "\t& & + \\frac{\\nu}{2} \\sum_{j=1}^n (\\log u_j - u_j) - \\sum_{j=1}^n \\log u_j.\n", "\\end{eqnarray*}\n", "\n", "* Since gamma is the conjugate prior for normal-gamma model, conditional distribution of $U$ given $\\mathbf{W} = \\mathbf{w}$ is gamma$((\\nu+p)/2, (\\nu + \\delta(\\mathbf{w},\\mu; \\Sigma))/2)$. Thus\n", "\\begin{eqnarray*}\n", "\t\\mathbf{E} (U_j \\mid \\mathbf{w}_j, \\mu^{(t)},\\Sigma^{(t)},\\nu^{(t)}) &=& \\frac{\\nu^{(t)} + p}{\\nu^{(t)} + \\delta(\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})}\t =: u_j^{(t)} \\\\\n", "\t\\mathbf{E} (\\log U_j \\mid \\mathbf{w}_j, \\mu^{(t)},\\Sigma^{(t)},\\nu^{(t)}) &=& \\log u_j^{(t)} + \\left[ \\psi \\left(\\frac{\\nu^{(t)}+p}{2}\\right) - \\log \\left(\\frac{\\nu^{(t)}+p}{2}\\right) \\right].\n", "\\end{eqnarray*}\n", "\n", "* Overall $Q$ function (up to an additive constant) takes the form\n", "\\begin{eqnarray*}\n", "\t%Q(\\mubf,\\Sigmabf,\\nu \\mid \\mubf^{(t)},\\Sigmabf^{(t)},\\nu^{(t)}) = \n", "\t& & - \\frac n2 \\log \\det \\Sigma - \\frac 12 \\sum_{j=1}^n u_j^{(t)} (\\mathbf{w}_j - \\mu)^T \\Sigma^{-1} (\\mathbf{w}_j - \\mu) \\\\\n", "\t& & - n \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) + \\frac{n \\nu}{2} \\log \\left(\\frac{\\nu}{2}\\right) \\\\\n", "\t& & + \\frac{n \\nu}{2} \\left[ \\frac 1n \\sum_{j=1}^n \\left(\\log u_j^{(t)} - u_j^{(t)}\\right) + \\psi\\left(\\frac{\\nu^{(t)}+p}{2}\\right) - \\log \\left(\\frac{\\nu^{(t)}+p}{2}\\right) \\right].\n", "\\end{eqnarray*}\n", "\n", "* Maximization over $(\\mu, \\Sigma)$ is simply a weighted multivariate normal problem\n", "\\begin{eqnarray*}\n", "\t\\mu^{(t+1)} &=& \\frac{\\sum_{j=1}^n u_j^{(t)} \\mathbf{w}_j}{\\sum_{j=1}^n u_j^{(t)}} \\\\\n", "\t\\Sigma^{(t+1)} &=& \\frac 1n \\sum_{j=1}^n u_j^{(t)} (\\mathbf{w}_j - \\mu^{(t+1)}) (\\mathbf{w}_j - \\mu^{(t+1)})^T.\n", "\\end{eqnarray*}\n", "Notice down-weighting of outliers is obvious in the update.\n", "\n", "* Maximization over $\\nu$ is a univariate problem - Brent method, golden section, or bisection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ECM and ECME\n", "\n", "Partition parameter $(\\mu,\\Sigma,\\nu)$ into two blocks $(\\mu,\\Sigma)$ and $\\nu$: \n", "\n", "* ECM = EM for this example. Why?\n", "\n", "* ECME: In the second CM step, maximize over $\\nu$ in terms of the original log-likelihood function instead of the $Q$ function. They have similar difficulty since both are univariate optimization problems!\n", "\n", "* An example from [Liu and Rubin (1994)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=The+ECME+algorithm%3A+a+simple+extension+of+EM+and+ECM+with+faster+monotone+convergence&btnG=). $n=79$, $p=2$, with missing entries. EM=ECM: solid line. ECME: dashed line.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Efficient data augmentation\n", "\n", "Assume $\\nu$ known:\n", "\n", "* Write $\\mathbf{W}_j = \\mu + \\mathbf{C}_j / U_j^{1/2}$, where $\\mathbf{C}_j$ is $N(\\mathbf{0},\\Sigma)$ independent of $U_j$ which is gamma$\\left(\\frac{\\nu}{2},\\frac{\\nu}{2}\\right)$.\n", "\n", "* $\\mathbf{W}_j = \\mu + \\det (\\Sigma)^{-a/2} \\mathbf{C}_j / U_j^{1/2}(a)$, where $U_j(a) = \\det(\\Sigma)^{-a} U_j$.\n", "\n", "* Then the complete data is $(\\mathbf{w}, \\mathbf{z}(a))$, $a$ the working parameter.\n", "\n", "* $a=0$ corresponds to the vanilla EM.\n", "\n", "* Meng and van Dyk (1997) recommend using $a_{\\text{opt}} = 1/(\\nu+p)$ to maximize the convergence rate.\n", "\n", "* Exercise: work out the EM update for this special case.\n", "\n", "* The only change to the vanilla EM is to replace the denominator $n$ in the update of $\\Sigma$ by $\\sum_{j=1}^n u_j^{(t)}$.\n", "\n", "* PX-EM (Liu, Rubin, and Wu, 1998) leads to the same update.\n", "\n", "\n", "\n", "Assume $\\nu$ unknown:\n", "\n", "* Version 1: $a=a_{\\text{opt}}$ in both updating of $(\\mu,\\Sigma)$ and $\\nu$. \n", "\n", "* Version 2: $a=a_{\\text{opt}}$ for updating $(\\mu,\\Sigma)$ and taking the observed data as complete data for updating $\\nu$.\n", "\n", "Conclusion in Meng and van Dyk (1997): Version 1 is $8 \\sim 12$ faster than EM=ECM or ECME. Version 2 is only slightly more efficient than Version 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalizations of EM - handle difficulties in E step\n", "\n", "### Monte Carlo EM (MCEM)\n", "\n", "* [Wei and Tanner (1990)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=A+Monte+Carlo+implementation+of+the+EM+algorithm+and+the+poor+man’s+data+augmentation+algorithms&btnG=).\n", "\n", "* Hard to calculate $Q$ function? Simulate!\n", "\\begin{eqnarray*}\n", "\tQ(\\theta \\mid \\theta^{(t)}) \\approx \\frac 1m \\sum_{j=1}^m \\ln f(\\mathbf{y}, \\mathbf{z}_j \\mid \\theta),\n", "\\end{eqnarray*}\n", "where $\\mathbf{z}_j$ are iid from conditional distribution of missing data given $\\mathbf{y}$ and previous iterate $\\theta^{(t)}$.\n", "\n", "* Ascent property may be lost due to Monte Carlo errors.\n", "\n", "* Example: capture-recapture model, generalized linear mixed model (GLMM)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stochastic EM (SEM) and Simulated Annealing EM (SAEM)\n", "\n", "SEM\n", "\n", "* Celeux and Diebolt (1985).\n", "\n", "* Same as MCEM with $m=1$. A single draw of missing data $\\mathbf{z}$ from the conditional distribution.\n", "\n", "* $\\theta^{(t)}$ forms a Markov chain which converges to a stationary distribution. No definite relation between this stationary distribution and the MLE.\n", "\n", "* In some specific cases, it can be shown that the stationary distribution concentrates around the MLE with a variance inversely proportional to the sample size.\n", "\n", "* Advantage: can escape the attraction to inferior mode/saddle point in some mixture model problems.\n", "\n", "SAEM\n", "\n", "* Celeux and Deibolt (1989).\n", "\n", "* Increase $m$ with the iterations, ending up with an EM algorithm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data-augmentation (DA) algorithm\n", "\n", "* [Tanner and Wong (1987)](https://scholar.google.com/scholar?q=The+calculation+of+posterior+distributions+by+data+augmentation&hl=en&as_sdt=0&as_vis=1&oi=scholart).\n", "\n", "* Aim for sampling from $p(\\theta|\\mathbf{y})$ instead of maximization.\n", "\n", "* Idea: incomplete data posterior density is complicated, but the complete-data posterior density is relatively easy to sample.\n", "\n", "* Data augmentation algorithm:\n", " * draw $\\mathbf{z}^{(t+1)}$ conditional on $(\\theta^{(t)},\\mathbf{y})$\n", " * draw $\\theta^{(t+1)}$ conditional on $(\\mathbf{z}^{(t+1)},\\mathbf{y})$\n", "\n", "* A special case of Gibbs sampler.\n", "\n", "* $\\theta^{(t)}$ converges to the distribution $p(\\theta|\\mathbf{y})$ under general conditions.\n", "\n", "* Ergodic mean converges to the posterior mean $\\mathbf{E}(\\theta|\\mathbf{y})$, which may perform better than MLE in finite sample." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EM as a maximization-maximization procedure\n", "\n", "* [Neal and Hinton (1999)](https://doi.org/10.1007/978-94-011-5014-9_12).\n", "\n", "* Consider the objective function\n", "\\begin{eqnarray*}\n", "\tF(\\theta, q) = \\mathbf{E}_q [\\ln f(\\mathbf{Z},\\mathbf{y} \\mid \\theta)] + H(q)\n", "\\end{eqnarray*}\n", "over $\\Theta \\times {\\cal Q}$, where ${\\cal Q}$ is the set of all conditional pdfs of the missing data $\\{q(\\mathbf{z}) = p(\\mathbf{z} \\mid \\mathbf{y}, \\theta), \\theta \\in \\Theta \\}$ and $H(q) = -\\mathbf{E}_q \\log q$ is the entropy of $q$.\n", "\n", "* EM is essentially performing coordinate ascent for maximizing $F$\n", "\n", " * E step: At current iterate $\\theta^{(t)}$, \n", "\\begin{eqnarray*}\n", "\tF(\\theta^{(t)}, q) = \\mathbf{E}_q [\\log p(\\mathbf{Z} \\mid \\mathbf{y}, \\theta^{(t)})] - \\mathbf{E}_q \\log q + \\ln g(\\mathbf{y} \\mid \\theta^{(t)}).\n", "\\end{eqnarray*}\n", "The maximizing conditional pdf is given by $q = \\log f(\\mathbf{Z} \\mid \\mathbf{y}, \\theta^{(t)})$ \n", "\n", " * M step: Substitute $q = \\log f(\\mathbf{Z} \\mid \\mathbf{y}, \\theta^{(t)})$ into $F$ and maximize over $\\theta$\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Incremental EM\n", "\n", "* Assume iid observations. Then the objective function is\n", "\\begin{eqnarray*}\n", "\tF(\\theta,q) = \\sum_i \\left[ \\mathbf{E}_{q_i} \\log p(\\mathbf{Z}_i,\\mathbf{y}_i \\mid \\theta) + H(q_i) \\right],\n", "\\end{eqnarray*}\n", "where we search for $q$ under factored form $q(\\mathbf{z}) = \\prod_i q_i (\\mathbf{z})$.\n", "\n", "* Maximizing $F$ over $q$ is equivalent to maximizing the contribution of each data with respect to $q_i$.\n", "\n", "* Update $\\theta$ by visiting data items **sequentially** rather than from a global E step.\n", "\n", "* Finite mixture example: for each data point $i$\n", " * evaluate membership variables $w_{ij}$, $j=1,\\ldots,k$. \n", " * update parameter values $(\\pi^{(t+1)}, \\mu_j^{(t+1)}, \\Sigma_j^{(t+1)})$ (only need to keep sufficient statistics!)\n", "\n", "* Faster convergence than vanilla EM in some examples.\n", "\n", "" ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.1.0", "language": "julia", "name": "julia-1.1" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.1.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "85px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": true, "toc_position": { "height": "369.3333435058594px", "left": "0px", "right": "814px", "top": "111.33333587646484px", "width": "156.1875px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }