--- title: EM Algorithm (KL Chapter 13) subtitle: Biostat/Biomath M257 author: Dr. Hua Zhou @ UCLA date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false jupyter: jupytext: formats: 'ipynb,qmd' text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.14.5 kernelspec: display_name: Julia 1.9.0 language: julia name: julia-1.9 --- ## Which statistical papers are most cited? | Paper | Citations (5/10/2022) | Per Year | |------------------------------------------|-----------:|----------:| | FDR ([Benjamini and Hochberg, 1995](https://scholar.google.com/scholar?q=Controlling+the+false+discovery+rate)) | 85002 | **3148** | | EM algorithm ([Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en)) | 66596 | **1480** | | Kaplan-Meier ([Kaplan and Meier, 1958](https://scholar.google.com/scholar?q=Nonparametric+estimation+from+incomplete+observations&hl=en)) | 62107 | 970 | | Cox model ([Cox, 1972](https://scholar.google.com/scholar?q=Regression+models+and+life-tables&hl=en)) | 57383 | **1148** | | Metropolis algorithm ([Metropolis et al., 1953](https://scholar.google.com/scholar?q=Equation+of+state+calculations+by+fast+computing+machines)) | 47851 | 694 | | Lasso ([Tibshrani, 1996](https://scholar.google.com/scholar?q=Regression+shrinkage+and+selection+via+the+lasso)) | 46220 | **1778** | | Unit root test ([Dickey and Fuller, 1979](https://scholar.google.com/scholar?q=Distribution+of+the+estimators+for+autoregressive+time+series+with+a+unit+root)) | 33783 | 786 | | Compressed sensing ([Donoho, 2004](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=compressed+sensing&btnG=)) | 30714 | **1706** | | Propensity score ([Rosenbaum and Rubin, 1983](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=rosenbaum+rubin+1983&btnG=)) | 31926 | 819 | | Bootstrap ([Efron, 1979](https://scholar.google.com/scholar?q=Bootstrap+methods%3A+another+look+at+the+jackknife)) | 22789 | 530 | | FFT ([Cooley and Tukey, 1965](https://scholar.google.com/scholar?q=An+algorithm+for+the+machine+calculation+of+complex+Fourier+series)) | 17817 | 313 | | Inference and missing data ([Rubin, 1976](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Inference+and+missing+data&btnG=)) | 11672 | 254 | | Gibbs sampler ([Gelfand and Smith, 1990](https://scholar.google.com/scholar?q=Sampling-based+approaches+to+calculating+marginal+densities)) | 9334 | 292 | * Citation counts from Google Scholar on May 10, 2022. * EM is one of the most influential statistical ideas, finding applications in various branches of science. * History: Landmark paper [Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en) on EM algorithm. Same idea appears in parameter estimation in HMM (Baum-Welch algorithm) by [Baum et al., 1970](http://www.jstor.org/stable/2239727?seq=1#page_scan_tab_contents). ## EM algorithm * Goal: maximize the log-likelihood of the observed data $\ln g(\mathbf{y}|\theta)$ (optimization!) * Notations: * $\mathbf{Y}$: observed data * $\mathbf{Z}$: missing data * Idea: choose missing $\mathbf{Z}$ such that MLE for the complete data is easy. * Let $f(\mathbf{y},\mathbf{z} | \theta)$ be the density of complete data. * Iterative two-step procedure * **E step**: calculate the conditional expectation $$ Q(\theta|\theta^{(t)}) = \mathbf{E}_{\mathbf{Z}|\mathbf{Y}=\mathbf{y},\theta^{(t)}} [ \ln f(\mathbf{Y},\mathbf{Z}|\theta) \mid \mathbf{Y} = \mathbf{y}, \theta^{(t)}] $$ * **M step**: maximize $Q(\theta|\theta^{(t)})$ to generate the next iterate $$ \theta^{(t+1)} = \text{argmax}_{\theta} \, Q(\theta|\theta^{(t)}) $$ * (**Ascent property** of EM algorithm) By the information inequality, $$ \begin{eqnarray*} & & Q(\theta \mid \theta^{(t)}) - \ln g(\mathbf{y}|\theta) \\ &=& \mathbf{E} [\ln f(\mathbf{Y},\mathbf{Z}|\theta) | \mathbf{Y} = \mathbf{y}, \theta^{(t)}] - \ln g(\mathbf{y}|\theta) \\ &=& \mathbf{E} \left\{ \ln \left[ \frac{f(\mathbf{Y}, \mathbf{Z} \mid \theta)}{g(\mathbf{Y} \mid \theta)} \right] \mid \mathbf{Y} = \mathbf{y}, \theta^{(t)} \right\} \\ &\le& \mathbf{E} \left\{ \ln \left[ \frac{f(\mathbf{Y}, \mathbf{Z} \mid \theta^{(t)})}{g(\mathbf{Y} \mid \theta^{(t)})} \right] \mid \mathbf{Y} = \mathbf{y}, \theta^{(t)} \right\} \\ &=& Q(\theta^{(t)} \mid \theta^{(t)}) - \ln g(\mathbf{y} |\theta^{(t)}). \end{eqnarray*} $$ Rearranging shows that (**fundamental inequality of EM**) $$ \begin{eqnarray*} \ln g(\mathbf{y} \mid \theta) \ge Q(\theta \mid \theta^{(t)}) - Q(\theta^{(t)} \mid \theta^{(t)}) + \ln g(\mathbf{y} \mid \theta^{(t)}) \end{eqnarray*} $$ for all $\theta$; in particular $$ \begin{eqnarray*} \ln g(\mathbf{y} \mid \theta^{(t+1)}) &\ge& Q(\theta^{(t+1)} \mid \theta^{(t)}) - Q(\theta^{(t)} \mid \theta^{(t)}) + \ln g(\mathbf{y} \mid \theta^{(t)}) \\ &\ge& \ln g(\mathbf{y} \mid \theta^{(t)}). \end{eqnarray*} $$ Obviously we only need $$ Q(\theta^{(t+1)} \mid \theta^{(t)}) - Q(\theta^{(t)} \mid \theta^{(t)}) \ge 0 $$ for this ascent property to hold (**generalized EM**). * Intuition? Keep these pictures in mind * Under mild conditions, $\theta^{(t)}$ converges to a stationary point of $\ln g(\mathbf{y} \mid \theta)$. * Numerous applications of EM: finite mixture model, HMM (Baum-Welch algorithm), factor analysis, variance components model aka linear mixed model (LMM), hyper-parameter estimation in empirical Bayes procedure $\max_{\alpha} \int f(\mathbf{y}|\theta) \pi(\theta|\alpha) \, d \theta$, missing data, group/censorized/truncated model, ... See Chapters 13 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) by Kenneth Lange (2010) for more applications of EM. ## A canonical EM example: finite mixture models * Consider Gaussian finite mixture models with density $$ h(\mathbf{y}) = \sum_{j=1}^k \pi_j h_j(\mathbf{y} \mid \mu_j, \Omega_j), \quad \mathbf{y} \in \mathbf{R}^{d}, $$ where $$ h_j(\mathbf{y} \mid \mu_j, \Omega_j) = \left( \frac{1}{2\pi} \right)^{d/2} \det(\Omega_j)^{-1/2} e^{- \frac 12 (\mathbf{y} - \mu_j)^T \Omega_j^{-1} (\mathbf{y} - \mu_j)} $$ are densities of multivariate normals $N_d(\mu_j, \Omega_j)$. * Given iid data points $\mathbf{y}_1,\ldots,\mathbf{y}_n$, want to estimate parameters $$ \theta = (\pi_1, \ldots, \pi_k, \mu_1, \ldots, \mu_k, \Omega_1,\ldots,\Omega_k) $$ subject to constraint $\pi_j \ge 0, \sum_{j=1}^d \pi_j = 1, \Omega_j \succeq \mathbf{0}$. * (Incomplete) data log-likelihood is \begin{eqnarray*} \ln g(\mathbf{y}_1,\ldots,\mathbf{y}_n|\theta) &=& \sum_{i=1}^n \ln h(\mathbf{y}_i) = \sum_{i=1}^n \ln \sum_{j=1}^k \pi_j h_j(\mathbf{y}_i \mid \mu_j, \Omega_j) \\ &=& \sum_{i=1}^n \ln \left[ \sum_{j=1}^k \pi_j \left( \frac{1}{2\pi} \right)^{d/2} \det(\Omega_j)^{-1/2} e^{- \frac 12 (\mathbf{y}_i - \mu_j)^T \Omega_j^{-1} (\mathbf{y}_i - \mu_j)} \right]. \end{eqnarray*} * Let $z_{ij} = I \{\mathbf{y}_i$ comes from group $j \}$ be the missing data. The complete data likelihood is $$ \begin{eqnarray*} f(\mathbf{y},\mathbf{z} | \theta) = \prod_{i=1}^n \prod_{j=1}^k [\pi_j h_j(\mathbf{y}_i|\mu_j,\Omega_j)]^{z_{ij}} \end{eqnarray*} $$ and thus the complete log-likelihood is $$ \begin{eqnarray*} \ln f(\mathbf{y},\mathbf{z} | \theta) = \sum_{i=1}^n \sum_{j=1}^k z_{ij} [\ln \pi_j + \ln h_j(\mathbf{y}_i|\mu_j,\Omega_j)]. \end{eqnarray*} $$ * **E step**: need to evaluate conditional expectation $$ \begin{eqnarray*} & & Q(\theta|\theta^{(t)}) = \mathbf{E} \left\{ \sum_{i=1}^n \sum_{j=1}^k z_{ij} [\ln \pi_j + \ln h_j(\mathbf{y}_i|\mu_j,\Omega_j)] \mid \mathbf{Y}=\mathbf{y}, \pi^{(t)}, \mu_1^{(t)}, \ldots, \mu_k^{(t)}, \Omega_1^{(t)}, \ldots, \Omega_k^{(t)} ] \right\}. \end{eqnarray*} $$ By Bayes rule, we have $$ \begin{eqnarray*} w_{ij}^{(t)} &:=& \mathbf{E} [z_{ij} \mid \mathbf{y}, \pi^{(t)}, \mu_1^{(t)}, \ldots, \mu_k^{(t)}, \Omega_1^{(t)}, \ldots, \Omega_k^{(t)}] \\ &=& \frac{\pi_j^{(t)} h_j(\mathbf{y}_i|\mu_j^{(t)},\Omega_j^{(t)})}{\sum_{j'=1}^k \pi_{j'}^{(t)} h_{j'}(\mathbf{y}_i|\mu_{j'}^{(t)},\Omega_{j'}^{(t)})}. \end{eqnarray*} $$ So the Q function becomes $$ \begin{eqnarray*} & & Q(\theta|\theta^{(t)}) = \sum_{i=1}^n \sum_{j=1}^k w_{ij}^{(t)} \ln \pi_j + \sum_{i=1}^n \sum_{j=1}^k w_{ij}^{(t)} \left[ - \frac{1}{2} \ln \det \Omega_j - \frac{1}{2} (\mathbf{y}_i - \mu_j)^T \Omega_j^{-1} (\mathbf{y}_i - \mu_j) \right]. \end{eqnarray*} $$ * **M step**: maximizer of the Q function gives the next iterate $$ \begin{eqnarray*} \pi_j^{(t+1)} &=& \frac{\sum_i w_{ij}^{(t)}}{n} \\ \mu_j^{(t+1)} &=& \frac{\sum_{i=1}^n w_{ij}^{(t)} \mathbf{y}_i}{\sum_{i=1}^n w_{ij}^{(t)}} \\ \Omega_j^{(t+1)} &=& \frac{\sum_{i=1}^n w_{ij}^{(t)} (\mathbf{y}_i - \mu_j^{(t+1)}) (\mathbf{y}_i - \mu_j^{(t+1)})^T}{\sum_i w_{ij}^{(t)}}. \end{eqnarray*} $$ See KL Example 11.3.1 for multinomial MLE. See KL Example 11.2.3 for multivariate normal MLE. * Compare these extremely simple updates to Newton type algorithms! * Also note the ease of parallel computing with this EM algorithm. See, e.g., **Suchard, M. A.**; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M. Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. _Journal of Computational and Graphical Statistics_, 2010, 19, 419-438. ## Generalizations of EM - handle difficulties in M step ### Expectation Conditional Maximization (ECM) * [Meng and Rubin (1993)](https://scholar.google.com/scholar?q=Maximum+likelihood+estimation+via+the+ECM+algorithm:+a+general+framework&hl=en&as_sdt=0&as_vis=1&oi=scholart). * In some problems the M step is difficult (no analytic solution). * Conditional maximization is easy (block ascent). * partition parameter vector into blocks $\theta = (\theta_1, \ldots, \theta_B)$. * alternately update $\theta_b$, $b=1,\ldots,B$. * Ascent property still holds. Why? * ECM may converge slower than EM (more iterations) but the total computer time may be shorter due to ease of the CM step. ### ECM Either (ECME) * [Liu and Rubin (1994)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=The+ECME+algorithm%3A+a+simple+extension+of+EM+and+ECM+with+faster+monotone+convergence&btnG=). * Each CM step maximizes either the $Q$ function or the original incomplete observed log-likelihood. * Ascent property still holds. Why? * Faster convergence than ECM. ### Alternating ECM (AECM) * [Meng and van Dyk (1997)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=The+EM+algorithm+an+old+folk-song+sung+to+a+fast+new+tune&btnG=). * The specification of the complete-data is allowed to be different on each CM-step. * Ascent property still holds. Why? ### PX-EM and efficient data augmentation * Parameter-Expanded EM. [Liu, Rubin, and Wu (1998)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Parameter+expansion+to+accelerate+EM%3A+the+px-em+algorithm&btnG=), [Liu and Wu (1999)](https://scholar.google.com/scholar?cluster=15342818142955984734&hl=en&as_sdt=0,5). * Efficient data augmentation. [Meng and van Dyk (1997)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=The+EM+algorithm+an+old+folk-song+sung+to+a+fast+new+tune&btnG=). * Idea: Speed up the convergence of EM algorithm by efficiently augmenting the observed data (introducing a working parameter in the specification of complete data). ### Example: t distribution #### EM * $\mathbf{W} \in \mathbb{R}^{p}$ is a multivariate $t$-distribution $t_p(\mu,\Sigma,\nu)$ if $\mathbf{W} \sim N(\mu, \Sigma/u)$ and $u \sim \text{gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right)$. * Widely used for robust modeling. * Recall Gamma($\alpha,\beta$) has density \begin{eqnarray*} f(u \mid \alpha,\beta) = \frac{\beta^{\alpha} u^{\alpha-1}}{\Gamma(\alpha)} e^{-\beta u}, \quad u \ge 0, \end{eqnarray*} with mean $\mathbf{E}(U)=\alpha/\beta$, and $\mathbf{E}(\log U) = \psi(\alpha) - \log \beta$. * Density of $\mathbf{W}$ is \begin{eqnarray*} f_p(\mathbf{w} \mid \mu, \Sigma, \nu) = \frac{\Gamma \left(\frac{\nu+p}{2}\right) \text{det}(\Sigma)^{-1/2}}{(\pi \nu)^{p/2} \Gamma(\nu/2) [1 + \delta (\mathbf{w},\mu;\Sigma)/\nu]^{(\nu+p)/2}}, \end{eqnarray*} where \begin{eqnarray*} \delta (\mathbf{w},\mu;\Sigma) = (\mathbf{w} - \mu)^T \Sigma^{-1} (\mathbf{w} - \mu) \end{eqnarray*} is the Mahalanobis squared distance between $\mathbf{w}$ and $\mu$. * Given iid data $\mathbf{w}_1, \ldots, \mathbf{w}_n$, the log-likelihood is \begin{eqnarray*} L(\mu, \Sigma, \nu) &=& - \frac{np}{2} \log (\pi \nu) + n \left[ \log \Gamma \left(\frac{\nu+p}{2}\right) - \log \Gamma \left(\frac{\nu}{2}\right) \right] \\ && - \frac n2 \log \det \Sigma + \frac n2 (\nu + p) \log \nu \\ && - \frac{\nu + p}{2} \sum_{j=1}^n \log [\nu + \delta (\mathbf{w}_j,\mu;\Sigma) ] \end{eqnarray*} How to compute the MLE $(\hat \mu, \hat \Sigma, \hat \nu)$? * $\mathbf{W}_j | u_j$ independent $N(\mu,\Sigma/u_j)$ and $U_j$ iid gamma $\left(\frac{\nu}{2},\frac{\nu}{2}\right)$. * Missing data: $\mathbf{z}=(u_1,\ldots,u_n)^T$. * Log-likelihood of complete data is \begin{eqnarray*} L_c (\mu, \Sigma, \nu) &=& - \frac{np}{2} \ln (2\pi) - \frac{n}{2} \log \det \Sigma \\ & & - \frac 12 \sum_{j=1}^n u_j (\mathbf{w}_j - \mu)^T \Sigma^{-1} (\mathbf{w}_j - \mu) \\ & & - n \log \Gamma \left(\frac{\nu}{2}\right) + \frac{n \nu}{2} \log \left(\frac{\nu}{2}\right) \\ & & + \frac{\nu}{2} \sum_{j=1}^n (\log u_j - u_j) - \sum_{j=1}^n \log u_j. \end{eqnarray*} * Since gamma is the conjugate prior for normal-gamma model, conditional distribution of $U$ given $\mathbf{W} = \mathbf{w}$ is gamma$((\nu+p)/2, (\nu + \delta(\mathbf{w},\mu; \Sigma))/2)$. Thus \begin{eqnarray*} \mathbf{E} (U_j \mid \mathbf{w}_j, \mu^{(t)},\Sigma^{(t)},\nu^{(t)}) &=& \frac{\nu^{(t)} + p}{\nu^{(t)} + \delta(\mathbf{w}_j,\mu^{(t)};\Sigma^{(t)})} =: u_j^{(t)} \\ \mathbf{E} (\log U_j \mid \mathbf{w}_j, \mu^{(t)},\Sigma^{(t)},\nu^{(t)}) &=& \log u_j^{(t)} + \left[ \psi \left(\frac{\nu^{(t)}+p}{2}\right) - \log \left(\frac{\nu^{(t)}+p}{2}\right) \right]. \end{eqnarray*} * Overall $Q$ function (up to an additive constant) takes the form \begin{eqnarray*} %Q(\mubf,\Sigmabf,\nu \mid \mubf^{(t)},\Sigmabf^{(t)},\nu^{(t)}) = & & - \frac n2 \log \det \Sigma - \frac 12 \sum_{j=1}^n u_j^{(t)} (\mathbf{w}_j - \mu)^T \Sigma^{-1} (\mathbf{w}_j - \mu) \\ & & - n \log \Gamma \left(\frac{\nu}{2}\right) + \frac{n \nu}{2} \log \left(\frac{\nu}{2}\right) \\ & & + \frac{n \nu}{2} \left[ \frac 1n \sum_{j=1}^n \left(\log u_j^{(t)} - u_j^{(t)}\right) + \psi\left(\frac{\nu^{(t)}+p}{2}\right) - \log \left(\frac{\nu^{(t)}+p}{2}\right) \right]. \end{eqnarray*} * Maximization over $(\mu, \Sigma)$ is simply a weighted multivariate normal problem \begin{eqnarray*} \mu^{(t+1)} &=& \frac{\sum_{j=1}^n u_j^{(t)} \mathbf{w}_j}{\sum_{j=1}^n u_j^{(t)}} \\ \Sigma^{(t+1)} &=& \frac 1n \sum_{j=1}^n u_j^{(t)} (\mathbf{w}_j - \mu^{(t+1)}) (\mathbf{w}_j - \mu^{(t+1)})^T. \end{eqnarray*} Notice down-weighting of outliers is obvious in the update. * Maximization over $\nu$ is a univariate problem - Brent method, golden section, or bisection. #### ECM and ECME Partition parameter $(\mu,\Sigma,\nu)$ into two blocks $(\mu,\Sigma)$ and $\nu$: * ECM = EM for this example. Why? * ECME: In the second CM step, maximize over $\nu$ in terms of the original log-likelihood function instead of the $Q$ function. They have similar difficulty since both are univariate optimization problems! * An example from [Liu and Rubin (1994)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=The+ECME+algorithm%3A+a+simple+extension+of+EM+and+ECM+with+faster+monotone+convergence&btnG=). $n=79$, $p=2$, with missing entries. EM=ECM: solid line. ECME: dashed line. #### Efficient data augmentation Assume $\nu$ known: * Write $\mathbf{W}_j = \mu + \mathbf{C}_j / U_j^{1/2}$, where $\mathbf{C}_j$ is $N(\mathbf{0},\Sigma)$ independent of $U_j$ which is gamma$\left(\frac{\nu}{2},\frac{\nu}{2}\right)$. * $\mathbf{W}_j = \mu + \det (\Sigma)^{-a/2} \mathbf{C}_j / U_j^{1/2}(a)$, where $U_j(a) = \det(\Sigma)^{-a} U_j$. * Then the complete data is $(\mathbf{w}, \mathbf{z}(a))$, $a$ the working parameter. * $a=0$ corresponds to the vanilla EM. * Meng and van Dyk (1997) recommend using $a_{\text{opt}} = 1/(\nu+p)$ to maximize the convergence rate. * Exercise: work out the EM update for this special case. * The only change to the vanilla EM is to replace the denominator $n$ in the update of $\Sigma$ by $\sum_{j=1}^n u_j^{(t)}$. * PX-EM (Liu, Rubin, and Wu, 1998) leads to the same update. Assume $\nu$ unknown: * Version 1: $a=a_{\text{opt}}$ in both updating of $(\mu,\Sigma)$ and $\nu$. * Version 2: $a=a_{\text{opt}}$ for updating $(\mu,\Sigma)$ and taking the observed data as complete data for updating $\nu$. Conclusion in Meng and van Dyk (1997): Version 1 is $8 \sim 12$ faster than EM=ECM or ECME. Version 2 is only slightly more efficient than Version 1 ## Generalizations of EM - handle difficulties in E step ### Monte Carlo EM (MCEM) * [Wei and Tanner (1990)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=A+Monte+Carlo+implementation+of+the+EM+algorithm+and+the+poor+man’s+data+augmentation+algorithms&btnG=). * Hard to calculate $Q$ function? Simulate! \begin{eqnarray*} Q(\theta \mid \theta^{(t)}) \approx \frac 1m \sum_{j=1}^m \ln f(\mathbf{y}, \mathbf{z}_j \mid \theta), \end{eqnarray*} where $\mathbf{z}_j$ are iid from conditional distribution of missing data given $\mathbf{y}$ and previous iterate $\theta^{(t)}$. * Ascent property may be lost due to Monte Carlo errors. * Example: capture-recapture model, generalized linear mixed model (GLMM). ### Stochastic EM (SEM) and Simulated Annealing EM (SAEM) SEM * Celeux and Diebolt (1985). * Same as MCEM with $m=1$. A single draw of missing data $\mathbf{z}$ from the conditional distribution. * $\theta^{(t)}$ forms a Markov chain which converges to a stationary distribution. No definite relation between this stationary distribution and the MLE. * In some specific cases, it can be shown that the stationary distribution concentrates around the MLE with a variance inversely proportional to the sample size. * Advantage: can escape the attraction to inferior mode/saddle point in some mixture model problems. SAEM * Celeux and Deibolt (1989). * Increase $m$ with the iterations, ending up with an EM algorithm. ### Data-augmentation (DA) algorithm * [Tanner and Wong (1987)](https://scholar.google.com/scholar?q=The+calculation+of+posterior+distributions+by+data+augmentation&hl=en&as_sdt=0&as_vis=1&oi=scholart). * Aim for sampling from $p(\theta|\mathbf{y})$ instead of maximization. * Idea: incomplete data posterior density is complicated, but the complete-data posterior density is relatively easy to sample. * Data augmentation algorithm: * draw $\mathbf{z}^{(t+1)}$ conditional on $(\theta^{(t)},\mathbf{y})$ * draw $\theta^{(t+1)}$ conditional on $(\mathbf{z}^{(t+1)},\mathbf{y})$ * A special case of Gibbs sampler. * $\theta^{(t)}$ converges to the distribution $p(\theta|\mathbf{y})$ under general conditions. * Ergodic mean converges to the posterior mean $\mathbf{E}(\theta|\mathbf{y})$, which may perform better than MLE in finite sample. ## EM as a maximization-maximization procedure * [Neal and Hinton (1999)](https://doi.org/10.1007/978-94-011-5014-9_12). * Consider the objective function \begin{eqnarray*} F(\theta, q) = \mathbf{E}_q [\ln f(\mathbf{Z},\mathbf{y} \mid \theta)] + H(q) \end{eqnarray*} over $\Theta \times {\cal Q}$, where ${\cal Q}$ is the set of all conditional pdfs of the missing data $\{q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{y}, \theta), \theta \in \Theta \}$ and $H(q) = -\mathbf{E}_q \log q$ is the entropy of $q$. * EM is essentially performing coordinate ascent for maximizing $F$ * E step: At current iterate $\theta^{(t)}$, \begin{eqnarray*} F(\theta^{(t)}, q) = \mathbf{E}_q [\log p(\mathbf{Z} \mid \mathbf{y}, \theta^{(t)})] - \mathbf{E}_q \log q + \ln g(\mathbf{y} \mid \theta^{(t)}). \end{eqnarray*} The maximizing conditional pdf is given by $q = \log f(\mathbf{Z} \mid \mathbf{y}, \theta^{(t)})$ * M step: Substitute $q = \log f(\mathbf{Z} \mid \mathbf{y}, \theta^{(t)})$ into $F$ and maximize over $\theta$ ## Incremental EM * Assume iid observations. Then the objective function is \begin{eqnarray*} F(\theta,q) = \sum_i \left[ \mathbf{E}_{q_i} \log p(\mathbf{Z}_i,\mathbf{y}_i \mid \theta) + H(q_i) \right], \end{eqnarray*} where we search for $q$ under factored form $q(\mathbf{z}) = \prod_i q_i (\mathbf{z})$. * Maximizing $F$ over $q$ is equivalent to maximizing the contribution of each data with respect to $q_i$. * Update $\theta$ by visiting data items **sequentially** rather than from a global E step. * Finite mixture example: for each data point $i$ * evaluate membership variables $w_{ij}$, $j=1,\ldots,k$. * update parameter values $(\pi^{(t+1)}, \mu_j^{(t+1)}, \Sigma_j^{(t+1)})$ (only need to keep sufficient statistics!) * Faster convergence than vanilla EM in some examples.