{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#EM-Algorithm-(KL-Chapter-13)\" data-toc-modified-id=\"EM-Algorithm-(KL-Chapter-13)-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>EM Algorithm (KL Chapter 13)</a></div><div class=\"lev2 toc-item\"><a href=\"#Which-statistical-papers-are-most-cited?\" data-toc-modified-id=\"Which-statistical-papers-are-most-cited?-11\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Which statistical papers are most cited?</a></div><div class=\"lev2 toc-item\"><a href=\"#EM-algorithm\" data-toc-modified-id=\"EM-algorithm-12\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>EM algorithm</a></div><div class=\"lev2 toc-item\"><a href=\"#A-canonical-EM-example:-finite-mixture-models\" data-toc-modified-id=\"A-canonical-EM-example:-finite-mixture-models-13\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>A canonical EM example: finite mixture models</a></div><div class=\"lev2 toc-item\"><a href=\"#Generalizations-of-EM---handle-difficulties-in-M-step\" data-toc-modified-id=\"Generalizations-of-EM---handle-difficulties-in-M-step-14\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Generalizations of EM - handle difficulties in M step</a></div><div class=\"lev3 toc-item\"><a href=\"#Expectation-Conditional-Maximization-(ECM)\" data-toc-modified-id=\"Expectation-Conditional-Maximization-(ECM)-141\"><span class=\"toc-item-num\">1.4.1&nbsp;&nbsp;</span>Expectation Conditional Maximization (ECM)</a></div><div class=\"lev3 toc-item\"><a href=\"#ECM-Either-(ECME)\" data-toc-modified-id=\"ECM-Either-(ECME)-142\"><span class=\"toc-item-num\">1.4.2&nbsp;&nbsp;</span>ECM Either (ECME)</a></div><div class=\"lev3 toc-item\"><a href=\"#Alternating-ECM-(AECM)\" data-toc-modified-id=\"Alternating-ECM-(AECM)-143\"><span class=\"toc-item-num\">1.4.3&nbsp;&nbsp;</span>Alternating ECM (AECM)</a></div><div class=\"lev3 toc-item\"><a href=\"#PX-EM-and-efficient-data-augmentation\" data-toc-modified-id=\"PX-EM-and-efficient-data-augmentation-144\"><span class=\"toc-item-num\">1.4.4&nbsp;&nbsp;</span>PX-EM and efficient data augmentation</a></div><div class=\"lev3 toc-item\"><a href=\"#Example:-t-distribution\" data-toc-modified-id=\"Example:-t-distribution-145\"><span class=\"toc-item-num\">1.4.5&nbsp;&nbsp;</span>Example: t distribution</a></div><div class=\"lev4 toc-item\"><a href=\"#EM\" data-toc-modified-id=\"EM-1451\"><span class=\"toc-item-num\">1.4.5.1&nbsp;&nbsp;</span>EM</a></div><div class=\"lev4 toc-item\"><a href=\"#ECM-and-ECME\" data-toc-modified-id=\"ECM-and-ECME-1452\"><span class=\"toc-item-num\">1.4.5.2&nbsp;&nbsp;</span>ECM and ECME</a></div><div class=\"lev4 toc-item\"><a href=\"#Efficient-data-augmentation\" data-toc-modified-id=\"Efficient-data-augmentation-1453\"><span class=\"toc-item-num\">1.4.5.3&nbsp;&nbsp;</span>Efficient data augmentation</a></div><div class=\"lev2 toc-item\"><a href=\"#Generalizations-of-EM---handle-difficulties-in-E-step\" data-toc-modified-id=\"Generalizations-of-EM---handle-difficulties-in-E-step-15\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>Generalizations of EM - handle difficulties in E step</a></div><div class=\"lev3 toc-item\"><a href=\"#Monte-Carlo-EM-(MCEM)\" data-toc-modified-id=\"Monte-Carlo-EM-(MCEM)-151\"><span class=\"toc-item-num\">1.5.1&nbsp;&nbsp;</span>Monte Carlo EM (MCEM)</a></div><div class=\"lev3 toc-item\"><a href=\"#Stochastic-EM-(SEM)-and-Simulated-Annealing-EM-(SAEM)\" data-toc-modified-id=\"Stochastic-EM-(SEM)-and-Simulated-Annealing-EM-(SAEM)-152\"><span class=\"toc-item-num\">1.5.2&nbsp;&nbsp;</span>Stochastic EM (SEM) and Simulated Annealing EM (SAEM)</a></div><div class=\"lev3 toc-item\"><a href=\"#Data-augmentation-(DA)-algorithm\" data-toc-modified-id=\"Data-augmentation-(DA)-algorithm-153\"><span class=\"toc-item-num\">1.5.3&nbsp;&nbsp;</span>Data-augmentation (DA) algorithm</a></div><div class=\"lev2 toc-item\"><a href=\"#EM-as-a-maximization-maximization-procedure\" data-toc-modified-id=\"EM-as-a-maximization-maximization-procedure-16\"><span class=\"toc-item-num\">1.6&nbsp;&nbsp;</span>EM as a maximization-maximization procedure</a></div><div class=\"lev2 toc-item\"><a href=\"#Incremental-EM\" data-toc-modified-id=\"Incremental-EM-17\"><span class=\"toc-item-num\">1.7&nbsp;&nbsp;</span>Incremental EM</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EM Algorithm (KL Chapter 13)\n",
    "\n",
    "## Which statistical papers are most cited?\n",
    "\n",
    "| Paper                                    | Citations (5/20/2019) | Per Year |\n",
    "|------------------------------------------|-----------:|----------:|\n",
    "| Kaplan-Meier ([Kaplan and Meier, 1958](https://scholar.google.com/scholar?q=Nonparametric+estimation+from+incomplete+observations&hl=en))    | 55672          |  913        |\n",
    "| EM algorithm ([Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en))               | 55195          | **1314**         |\n",
    "| FDR ([Benjamini and Hochberg, 1995](https://scholar.google.com/scholar?q=Controlling+the+false+discovery+rate))       | 54787          |  **2283**        |\n",
    "| Cox model ([Cox, 1972](https://scholar.google.com/scholar?q=Regression+models+and+life-tables&hl=en))                    | 49275          |  1048        |\n",
    "| Metropolis algorithm ([Metropolis et al., 1953](https://scholar.google.com/scholar?q=Equation+of+state+calculations+by+fast+computing+machines))     | 39181          |  594        |\n",
    "| lasso ([Tibshrani, 1996](https://scholar.google.com/scholar?q=Regression+shrinkage+and+selection+via+the+lasso))                  | 28074          |  **1221**        |\n",
    "| Unit root test ([Dickey and Fuller, 1979](https://scholar.google.com/scholar?q=Distribution+of+the+estimators+for+autoregressive+time+series+with+a+unit+root)) | 24947          |  624        |\n",
    "| bootstrap ([Efron, 1979](https://scholar.google.com/scholar?q=Bootstrap+methods%3A+another+look+at+the+jackknife))                  | 17626          |  441        |\n",
    "| FFT ([Cooley and Tukey, 1965](https://scholar.google.com/scholar?q=An+algorithm+for+the+machine+calculation+of+complex+Fourier+series))            | 14635          |  271        |\n",
    "| Gibbs sampler ([Gelfand and Smith, 1990](https://scholar.google.com/scholar?q=Sampling-based+approaches+to+calculating+marginal+densities))  |  7884         | 272         |\n",
    "\n",
    "* Citation counts from Google Scholar on May 20, 2019.\n",
    "\n",
    "* EM is one of the most influential statistical ideas, finding applications in various branches of science.\n",
    "\n",
    "* History: Landmark paper [Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en) on EM algorithm. Same idea appears in parameter estimation in HMM (Baum-Welch algorithm) by [Baum et al., 1970](http://www.jstor.org/stable/2239727?seq=1#page_scan_tab_contents).\n",
    "\n",
    "## EM algorithm\n",
    "\n",
    "* Goal: maximize the log-likelihood of the observed data $\\ln g(\\mathbf{y}|\\theta)$ (optimization!)\n",
    "\n",
    "* Notations: \n",
    "    * $\\mathbf{Y}$: observed data\n",
    "    * $\\mathbf{Z}$: missing data\n",
    "\n",
    "* Idea: choose missing $\\mathbf{Z}$ such that MLE for the complete data is easy.\n",
    "\n",
    "* Let $f(\\mathbf{y},\\mathbf{z} | \\theta)$ be the density of complete data.\n",
    "\n",
    "* Iterative two-step procedure\n",
    "     * **E step**: calculate the conditional expectation\n",
    "$$\n",
    "\tQ(\\theta|\\theta^{(t)}) = \\mathbf{E}_{\\mathbf{Z}|\\mathbf{Y}=\\mathbf{y},\\theta^{(t)}} [ \\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}]\n",
    "$$\n",
    "     * **M step**: maximize $Q(\\theta|\\theta^{(t)})$ to generate the next iterate \n",
    "$$\n",
    "    \\theta^{(t+1)} = \\text{argmax}_{\\theta} \\, Q(\\theta|\\theta^{(t)})\n",
    "$$\n",
    "\n",
    "* (**Ascent property** of EM algorithm) By the information inequality,\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t& & Q(\\theta \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y}|\\theta) \\\\\n",
    "\t&=& \\mathbf{E} [\\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) | \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}] - \\ln g(\\mathbf{y}|\\theta) \\\\\n",
    "\t&=& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta)}{g(\\mathbf{Y} \\mid \\theta)} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n",
    "\t&\\le& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta^{(t)})}{g(\\mathbf{Y} \\mid \\theta^{(t)})} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n",
    "\t&=& Q(\\theta^{(t)} \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y} |\\theta^{(t)}).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "Rearranging shows that (**fundamental inequality of EM**)\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\ln g(\\mathbf{y} \\mid \\theta) \\ge Q(\\theta \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)})\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "for all $\\theta$; in particular\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\ln g(\\mathbf{y} \\mid \\theta^{(t+1)}) &\\ge& Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)}) \\\\\n",
    "\t&\\ge& \\ln g(\\mathbf{y} \\mid \\theta^{(t)}).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "Obviously we only need \n",
    "$$\n",
    "    Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) \\ge 0\n",
    "$$\n",
    "for this ascent property to hold (**generalized EM**).\n",
    "\n",
    "* Intuition? Keep these pictures in mind\n",
    "\n",
    "<img src=\"./mm_illustration.pdf\" width=\"300\" align=\"center\"/>\n",
    "\n",
    "<img src=\"./mm_2Dgraph.pdf\" width=\"300\" align=\"center\"/>\n",
    "\n",
    "* Under mild conditions, $\\theta^{(t)}$ converges to a stationary point of $\\ln g(\\mathbf{y} \\mid \\theta)$.\n",
    "\n",
    "* Numerous applications of EM:  \n",
    "finite mixture model, HMM (Baum-Welch algorithm), factor analysis, variance components model aka linear mixed model (LMM), hyper-parameter estimation in empirical Bayes procedure $\\max_{\\alpha} \\int f(\\mathbf{y}|\\theta) \\pi(\\theta|\\alpha) \\, d \\theta$, missing data, group/censorized/truncated model, ...\n",
    "\n",
    "See Chapters 13 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) by Kenneth Lange (2010) for more applications of EM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A canonical EM example: finite mixture models\n",
    "\n",
    "<img src=\"https://images-na.ssl-images-amazon.com/images/I/41ujVg7%2BveL._SX310_BO1,204,203,200_.jpg\" width=\"200\" align=\"center\"/>\n",
    "\n",
    "* Consider Gaussian finite mixture models with density \n",
    "$$\n",
    "h(\\mathbf{y}) = \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j), \\quad \\mathbf{y} \\in \\mathbf{R}^{d},\n",
    "$$\n",
    "where\n",
    "$$\n",
    "\th_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j) = \\left( \\frac{1}{2\\pi} \\right)^{d/2} \\det(\\Omega_j)^{-1/2} e^{- \\frac 12 (\\mathbf{y} - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y} - \\mu_j)}\n",
    "$$ \n",
    "are densities of multivariate normals $N_d(\\mu_j, \\Omega_j)$. \n",
    "\n",
    "* Given iid data points $\\mathbf{y}_1,\\ldots,\\mathbf{y}_n$, want to estimate parameters\n",
    "$$\n",
    "\\theta = (\\pi_1, \\ldots, \\pi_k, \\mu_1, \\ldots, \\mu_k, \\Omega_1,\\ldots,\\Omega_k)\n",
    "$$\n",
    "subject to constraint $\\pi_j \\ge 0, \\sum_{j=1}^d \\pi_j = 1, \\Omega_j \\succeq \\mathbf{0}$. \n",
    "\n",
    "* (Incomplete) data log-likelihood is \n",
    "\\begin{eqnarray*}\n",
    "    \\ln g(\\mathbf{y}_1,\\ldots,\\mathbf{y}_n|\\theta) &=& \\sum_{i=1}^n \\ln h(\\mathbf{y}_i) = \\sum_{i=1}^n \\ln \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y}_i \\mid \\mu_j, \\Omega_j) \\\\\n",
    "    &=& \\sum_{i=1}^n \\ln \\left[ \\sum_{j=1}^k \\pi_j \\left( \\frac{1}{2\\pi} \\right)^{d/2} \\det(\\Omega_j)^{-1/2} e^{- \\frac 12 (\\mathbf{y}_i - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y}_i - \\mu_j)} \\right].\n",
    "\\end{eqnarray*}\n",
    "\n",
    "* Let $z_{ij} = I \\{\\mathbf{y}_i$ comes from group $j \\}$ be the missing data. The complete data likelihood is\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\tf(\\mathbf{y},\\mathbf{z} | \\theta) = \\prod_{i=1}^n \\prod_{j=1}^k [\\pi_j h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)]^{z_{ij}}\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "and thus the complete log-likelihood is\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\ln f(\\mathbf{y},\\mathbf{z} | \\theta) = \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)].\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "\n",
    "* **E step**: need to evaluate conditional expectation\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t& & Q(\\theta|\\theta^{(t)})\n",
    "\t= \\mathbf{E} \\left\\{ \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)] \\mid \\mathbf{Y}=\\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)} ] \\right\\}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "By Bayes rule, we have\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\tw_{ij}^{(t)} &:=& \\mathbf{E} [z_{ij} \\mid \\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)}] \\\\\n",
    "\t&=& \\frac{\\pi_j^{(t)} h_j(\\mathbf{y}_i|\\mu_j^{(t)},\\Omega_j^{(t)})}{\\sum_{j'=1}^k \\pi_{j'}^{(t)} h_{j'}(\\mathbf{y}_i|\\mu_{j'}^{(t)},\\Omega_{j'}^{(t)})}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "So the Q function becomes\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t& & Q(\\theta|\\theta^{(t)})\n",
    "\t= \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\ln \\pi_j + \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\left[ - \\frac{1}{2} \\ln \\det \\Omega_j - \\frac{1}{2} (\\mathbf{y}_i - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y}_i - \\mu_j) \\right].\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "\n",
    "* **M step**: maximizer of the Q function gives the next iterate\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\pi_j^{(t+1)} &=& \\frac{\\sum_i w_{ij}^{(t)}}{n} \\\\\n",
    "\t\\mu_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} \\mathbf{y}_i}{\\sum_{i=1}^n w_{ij}^{(t)}} \\\\\n",
    "\t\\Omega_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} (\\mathbf{y}_i - \\mu_j^{(t+1)}) (\\mathbf{y}_i - \\mu_j^{(t+1)})^T}{\\sum_i w_{ij}^{(t)}}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "See KL Example 11.3.1 for multinomial MLE. See KL Example 11.2.3 for multivariate normal MLE.\n",
    "\n",
    "* Compare these extremely simple updates to Newton type algorithms!\n",
    "\n",
    "* Also note the ease of parallel computing with this EM algorithm. See, e.g.,  \n",
    "**Suchard, M. A.**; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M. Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. _Journal of Computational and Graphical Statistics_, 2010, 19, 419-438."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generalizations of EM - handle difficulties in M step\n",
    "\n",
    "### Expectation Conditional Maximization (ECM) \n",
    "\n",
    "* [Meng and Rubin (1993)](https://scholar.google.com/scholar?q=Maximum+likelihood+estimation+via+the+ECM+algorithm:+a+general+framework&hl=en&as_sdt=0&as_vis=1&oi=scholart).\n",
    "\n",
    "* In some problems the M step is difficult (no analytic solution).\n",
    "\n",
    "* Conditional maximization is easy (block ascent).\n",
    "     * partition parameter vector into blocks $\\theta = (\\theta_1, \\ldots, \\theta_B)$. \n",
    "     * alternately update $\\theta_b$, $b=1,\\ldots,B$.\n",
    "\n",
    "<img src=\"./cd_graph.png\" width=\"200\" align=\"center\"/>\n",
    "\n",
    "* Ascent property still holds. Why?\n",
    "\n",
    "* ECM may converge slower than EM (more iterations) but the total computer time may be shorter due to ease of the CM step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ECM Either (ECME)\n",
    "\n",
    "* [Liu and Rubin (1994)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=The+ECME+algorithm%3A+a+simple+extension+of+EM+and+ECM+with+faster+monotone+convergence&btnG=).\n",
    "\n",
    "* Each CM step maximizes either the $Q$ function or the original incomplete observed log-likelihood.\n",
    "\n",
    "* Ascent property still holds. Why?\n",
    "\n",
    "* Faster convergence than ECM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Alternating ECM (AECM)\n",
    "\n",
    "* [Meng and van Dyk (1997)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=The+EM+algorithm+an+old+folk-song+sung+to+a+fast+new+tune&btnG=).\n",
    "\n",
    "* The specification of the complete-data is allowed to be different on each CM-step.\n",
    "\n",
    "* Ascent property still holds. Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PX-EM and efficient data augmentation\n",
    "\n",
    "* Parameter-Expanded EM. [Liu, Rubin, and Wu (1998)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Parameter+expansion+to+accelerate+EM%3A+the+px-em+algorithm&btnG=), [Liu and Wu (1999)](https://scholar.google.com/scholar?cluster=15342818142955984734&hl=en&as_sdt=0,5).\n",
    "\n",
    "* Efficient data augmentation. [Meng and van Dyk (1997)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=The+EM+algorithm+an+old+folk-song+sung+to+a+fast+new+tune&btnG=).\n",
    "\n",
    "* Idea: Speed up the convergence of EM algorithm by efficiently augmenting the observed data (introducing a working parameter in the specification of complete data)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: t distribution\n",
    "\n",
    "#### EM\n",
    "\n",
    "* $\\mathbf{W} \\in \\mathbb{R}^{p}$ is a multivariate $t$-distribution $t_p(\\mu,\\Sigma,\\nu)$ if $\\mathbf{W} \\sim N(\\mu, \\Sigma/u)$ and $u \\sim \\text{gamma}\\left(\\frac{\\nu}{2}, \\frac{\\nu}{2}\\right)$.\n",
    "\n",
    "* Widely used for robust modeling.\n",
    "\n",
    "* Recall Gamma($\\alpha,\\beta$) has density\n",
    "\\begin{eqnarray*}\n",
    "\tf(u \\mid \\alpha,\\beta) = \\frac{\\beta^{\\alpha} u^{\\alpha-1}}{\\Gamma(\\alpha)} e^{-\\beta u}, \\quad u  \\ge 0,\n",
    "\\end{eqnarray*}\n",
    "with mean $\\mathbf{E}(U)=\\alpha/\\beta$, and $\\mathbf{E}(\\log U) = \\psi(\\alpha) - \\log \\beta$.\n",
    "\n",
    "* Density of $\\mathbf{W}$ is\n",
    "\\begin{eqnarray*}\n",
    "\tf_p(\\mathbf{w} \\mid \\mu, \\Sigma, \\nu) = \\frac{\\Gamma \\left(\\frac{\\nu+p}{2}\\right) \\text{det}(\\Sigma)^{-1/2}}{(\\pi \\nu)^{p/2} \\Gamma(\\nu/2) [1 + \\delta (\\mathbf{w},\\mu;\\Sigma)/\\nu]^{(\\nu+p)/2}},\n",
    "\\end{eqnarray*}\n",
    "where\n",
    "\\begin{eqnarray*}\n",
    "    \\delta (\\mathbf{w},\\mu;\\Sigma) = (\\mathbf{w} - \\mu)^T \\Sigma^{-1} (\\mathbf{w} - \\mu)\n",
    "\\end{eqnarray*}\n",
    "is the Mahalanobis squared distance between $\\mathbf{w}$ and $\\mu$.\n",
    "\n",
    "* Given iid data $\\mathbf{w}_1, \\ldots, \\mathbf{w}_n$, the log-likelihood is\n",
    "\\begin{eqnarray*}\n",
    "\tL(\\mu, \\Sigma, \\nu) &=& - \\frac{np}{2} \\log (\\pi \\nu) + n \\left[ \\log \\Gamma \\left(\\frac{\\nu+p}{2}\\right) - \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) \\right]\t\\\\\n",
    "\t&& - \\frac n2 \\log \\det \\Sigma + \\frac n2 (\\nu + p) \\log \\nu \\\\\n",
    "\t&& - \\frac{\\nu + p}{2} \\sum_{j=1}^n \\log [\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma) ]\n",
    "\\end{eqnarray*}\n",
    "How to compute the MLE $(\\hat \\mu, \\hat \\Sigma, \\hat \\nu)$?\n",
    "\n",
    "* $\\mathbf{W}_j | u_j$ independent $N(\\mu,\\Sigma/u_j)$ and $U_j$ iid gamma $\\left(\\frac{\\nu}{2},\\frac{\\nu}{2}\\right)$.\n",
    "\n",
    "* Missing data: $\\mathbf{z}=(u_1,\\ldots,u_n)^T$.\n",
    "\n",
    "* Log-likelihood of complete data is\n",
    "\\begin{eqnarray*}\n",
    "\tL_c (\\mu, \\Sigma, \\nu) &=& - \\frac{np}{2} \\ln (2\\pi) - \\frac{n}{2} \\log \\det \\Sigma \\\\\n",
    "\t& & - \\frac 12 \\sum_{j=1}^n u_j (\\mathbf{w}_j - \\mu)^T \\Sigma^{-1} (\\mathbf{w}_j - \\mu) \\\\\n",
    "\t& & - n \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) + \\frac{n \\nu}{2} \\log \\left(\\frac{\\nu}{2}\\right) \\\\\n",
    "\t& & + \\frac{\\nu}{2} \\sum_{j=1}^n (\\log u_j - u_j) - \\sum_{j=1}^n \\log u_j.\n",
    "\\end{eqnarray*}\n",
    "\n",
    "* Since gamma is the conjugate prior for normal-gamma model, conditional distribution of $U$ given $\\mathbf{W} = \\mathbf{w}$ is gamma$((\\nu+p)/2, (\\nu + \\delta(\\mathbf{w},\\mu; \\Sigma))/2)$. Thus\n",
    "\\begin{eqnarray*}\n",
    "\t\\mathbf{E} (U_j \\mid \\mathbf{w}_j, \\mu^{(t)},\\Sigma^{(t)},\\nu^{(t)}) &=& \\frac{\\nu^{(t)} + p}{\\nu^{(t)} +  \\delta(\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})}\t =: u_j^{(t)} \\\\\n",
    "\t\\mathbf{E} (\\log U_j \\mid \\mathbf{w}_j, \\mu^{(t)},\\Sigma^{(t)},\\nu^{(t)}) &=& \\log  u_j^{(t)} + \\left[ \\psi \\left(\\frac{\\nu^{(t)}+p}{2}\\right) - \\log \\left(\\frac{\\nu^{(t)}+p}{2}\\right) \\right].\n",
    "\\end{eqnarray*}\n",
    "\n",
    "* Overall $Q$ function (up to an additive constant) takes the form\n",
    "\\begin{eqnarray*}\n",
    "\t%Q(\\mubf,\\Sigmabf,\\nu \\mid \\mubf^{(t)},\\Sigmabf^{(t)},\\nu^{(t)}) = \n",
    "\t& & - \\frac n2 \\log \\det \\Sigma - \\frac 12 \\sum_{j=1}^n u_j^{(t)} (\\mathbf{w}_j - \\mu)^T \\Sigma^{-1} (\\mathbf{w}_j - \\mu) \\\\\n",
    "\t& & - n \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) + \\frac{n \\nu}{2} \\log \\left(\\frac{\\nu}{2}\\right) \\\\\n",
    "\t& & + \\frac{n \\nu}{2} \\left[ \\frac 1n \\sum_{j=1}^n \\left(\\log u_j^{(t)} - u_j^{(t)}\\right) + \\psi\\left(\\frac{\\nu^{(t)}+p}{2}\\right) - \\log \\left(\\frac{\\nu^{(t)}+p}{2}\\right) \\right].\n",
    "\\end{eqnarray*}\n",
    "\n",
    "* Maximization over $(\\mu, \\Sigma)$ is simply a weighted multivariate normal problem\n",
    "\\begin{eqnarray*}\n",
    "\t\\mu^{(t+1)} &=& \\frac{\\sum_{j=1}^n u_j^{(t)} \\mathbf{w}_j}{\\sum_{j=1}^n u_j^{(t)}} \\\\\n",
    "\t\\Sigma^{(t+1)} &=& \\frac 1n \\sum_{j=1}^n u_j^{(t)} (\\mathbf{w}_j - \\mu^{(t+1)}) (\\mathbf{w}_j - \\mu^{(t+1)})^T.\n",
    "\\end{eqnarray*}\n",
    "Notice down-weighting of outliers is obvious in the update.\n",
    "\n",
    "* Maximization over $\\nu$ is a univariate problem - Brent method, golden section, or bisection."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ECM and ECME\n",
    "\n",
    "Partition parameter $(\\mu,\\Sigma,\\nu)$ into two blocks $(\\mu,\\Sigma)$ and $\\nu$:  \n",
    "\n",
    "* ECM = EM for this example. Why?\n",
    "\n",
    "* ECME: In the second CM step, maximize over $\\nu$ in terms of the original log-likelihood function instead of the $Q$ function. They have similar difficulty since both are univariate optimization problems!\n",
    "\n",
    "* An example from [Liu and Rubin (1994)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=The+ECME+algorithm%3A+a+simple+extension+of+EM+and+ECM+with+faster+monotone+convergence&btnG=). $n=79$, $p=2$, with missing entries. EM=ECM: solid line. ECME: dashed line.\n",
    "\n",
    "<img src=\"./t-em-ecme.png\" width=\"600\" align=\"center\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Efficient data augmentation\n",
    "\n",
    "Assume $\\nu$ known:\n",
    "\n",
    "* Write $\\mathbf{W}_j = \\mu + \\mathbf{C}_j / U_j^{1/2}$, where $\\mathbf{C}_j$ is $N(\\mathbf{0},\\Sigma)$ independent of $U_j$ which is gamma$\\left(\\frac{\\nu}{2},\\frac{\\nu}{2}\\right)$.\n",
    "\n",
    "* $\\mathbf{W}_j = \\mu + \\det (\\Sigma)^{-a/2} \\mathbf{C}_j / U_j^{1/2}(a)$, where $U_j(a) = \\det(\\Sigma)^{-a} U_j$.\n",
    "\n",
    "* Then the complete data is $(\\mathbf{w}, \\mathbf{z}(a))$, $a$ the working parameter.\n",
    "\n",
    "* $a=0$ corresponds to the vanilla EM.\n",
    "\n",
    "* Meng and van Dyk (1997) recommend using $a_{\\text{opt}} = 1/(\\nu+p)$ to maximize the convergence rate.\n",
    "\n",
    "* Exercise: work out the EM update for this special case.\n",
    "\n",
    "* The only change to the vanilla EM is to replace the denominator $n$ in the update of $\\Sigma$ by $\\sum_{j=1}^n u_j^{(t)}$.\n",
    "\n",
    "* PX-EM (Liu, Rubin, and Wu, 1998) leads to the same update.\n",
    "\n",
    "<img src=\"./t-aopt.png\" width=\"400\" align=\"center\"/>\n",
    "\n",
    "Assume $\\nu$ unknown:\n",
    "\n",
    "* Version 1: $a=a_{\\text{opt}}$ in both updating of $(\\mu,\\Sigma)$ and $\\nu$. \n",
    "\n",
    "* Version 2: $a=a_{\\text{opt}}$ for updating $(\\mu,\\Sigma)$ and taking the observed data as complete data for updating $\\nu$.\n",
    "\n",
    "Conclusion in Meng and van Dyk (1997): Version 1 is $8 \\sim 12$ faster than EM=ECM or ECME. Version 2 is only slightly more efficient than Version 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generalizations of EM - handle difficulties in E step\n",
    "\n",
    "### Monte Carlo EM (MCEM)\n",
    "\n",
    "* [Wei and Tanner (1990)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=A+Monte+Carlo+implementation+of+the+EM+algorithm+and+the+poor+man’s+data+augmentation+algorithms&btnG=).\n",
    "\n",
    "* Hard to calculate $Q$ function? Simulate!\n",
    "\\begin{eqnarray*}\n",
    "\tQ(\\theta \\mid \\theta^{(t)}) \\approx \\frac 1m \\sum_{j=1}^m \\ln f(\\mathbf{y}, \\mathbf{z}_j \\mid \\theta),\n",
    "\\end{eqnarray*}\n",
    "where $\\mathbf{z}_j$ are iid from conditional distribution of missing data given $\\mathbf{y}$ and previous iterate $\\theta^{(t)}$.\n",
    "\n",
    "* Ascent property may be lost due to Monte Carlo errors.\n",
    "\n",
    "* Example: capture-recapture model, generalized linear mixed model (GLMM)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stochastic EM (SEM) and Simulated Annealing EM (SAEM)\n",
    "\n",
    "SEM\n",
    "\n",
    "* Celeux and Diebolt (1985).\n",
    "\n",
    "* Same as MCEM with $m=1$. A single draw of missing data $\\mathbf{z}$ from the conditional distribution.\n",
    "\n",
    "* $\\theta^{(t)}$ forms a Markov chain which converges to a stationary distribution. No definite relation between this stationary distribution and the MLE.\n",
    "\n",
    "* In some specific cases, it can be shown that the stationary distribution concentrates around the MLE with a variance inversely proportional to the sample size.\n",
    "\n",
    "* Advantage: can escape the attraction to inferior mode/saddle point in some mixture model problems.\n",
    "\n",
    "SAEM\n",
    "\n",
    "* Celeux and Deibolt (1989).\n",
    "\n",
    "* Increase $m$ with the iterations, ending up with an EM algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data-augmentation (DA) algorithm\n",
    "\n",
    "* [Tanner and Wong (1987)](https://scholar.google.com/scholar?q=The+calculation+of+posterior+distributions+by+data+augmentation&hl=en&as_sdt=0&as_vis=1&oi=scholart).\n",
    "\n",
    "* Aim for sampling from $p(\\theta|\\mathbf{y})$ instead of maximization.\n",
    "\n",
    "* Idea: incomplete data posterior density is complicated, but the complete-data posterior density is relatively easy to sample.\n",
    "\n",
    "* Data augmentation algorithm:\n",
    "    * draw $\\mathbf{z}^{(t+1)}$ conditional on $(\\theta^{(t)},\\mathbf{y})$\n",
    "    * draw $\\theta^{(t+1)}$ conditional on $(\\mathbf{z}^{(t+1)},\\mathbf{y})$\n",
    "\n",
    "* A special case of Gibbs sampler.\n",
    "\n",
    "* $\\theta^{(t)}$ converges to the distribution $p(\\theta|\\mathbf{y})$ under general conditions.\n",
    "\n",
    "* Ergodic mean converges to the posterior mean $\\mathbf{E}(\\theta|\\mathbf{y})$, which may perform better than MLE in finite sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EM as a maximization-maximization procedure\n",
    "\n",
    "* [Neal and Hinton (1999)](https://doi.org/10.1007/978-94-011-5014-9_12).\n",
    "\n",
    "* Consider the objective function\n",
    "\\begin{eqnarray*}\n",
    "\tF(\\theta, q) = \\mathbf{E}_q [\\ln f(\\mathbf{Z},\\mathbf{y} \\mid \\theta)] + H(q)\n",
    "\\end{eqnarray*}\n",
    "over $\\Theta \\times {\\cal Q}$, where ${\\cal Q}$ is the set of all conditional pdfs of the missing data $\\{q(\\mathbf{z}) = p(\\mathbf{z} \\mid \\mathbf{y}, \\theta), \\theta \\in \\Theta \\}$ and $H(q) = -\\mathbf{E}_q \\log q$ is the entropy of $q$.\n",
    "\n",
    "* EM is essentially performing coordinate ascent for maximizing $F$\n",
    "\n",
    "    * E step: At current iterate $\\theta^{(t)}$,  \n",
    "\\begin{eqnarray*}\n",
    "\tF(\\theta^{(t)}, q) = \\mathbf{E}_q [\\log p(\\mathbf{Z} \\mid \\mathbf{y}, \\theta^{(t)})] - \\mathbf{E}_q \\log q + \\ln g(\\mathbf{y} \\mid \\theta^{(t)}).\n",
    "\\end{eqnarray*}\n",
    "The maximizing conditional pdf is given by $q = \\log f(\\mathbf{Z} \\mid \\mathbf{y}, \\theta^{(t)})$  \n",
    "\n",
    "    * M step: Substitute $q = \\log f(\\mathbf{Z} \\mid \\mathbf{y}, \\theta^{(t)})$ into $F$ and maximize over $\\theta$\n",
    "\n",
    "<img src=\"./em_maxmax.png\" width=\"400\" align=\"center\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Incremental EM\n",
    "\n",
    "* Assume iid observations. Then the objective function is\n",
    "\\begin{eqnarray*}\n",
    "\tF(\\theta,q) = \\sum_i \\left[ \\mathbf{E}_{q_i} \\log p(\\mathbf{Z}_i,\\mathbf{y}_i \\mid \\theta) + H(q_i) \\right],\n",
    "\\end{eqnarray*}\n",
    "where we search for $q$ under factored form $q(\\mathbf{z}) = \\prod_i q_i (\\mathbf{z})$.\n",
    "\n",
    "* Maximizing $F$ over $q$ is equivalent to maximizing the contribution of each data with respect to $q_i$.\n",
    "\n",
    "* Update $\\theta$ by visiting data items **sequentially** rather than from a global E step.\n",
    "\n",
    "* Finite mixture example: for each data point $i$\n",
    "     * evaluate membership variables $w_{ij}$, $j=1,\\ldots,k$.  \n",
    "     * update parameter values $(\\pi^{(t+1)}, \\mu_j^{(t+1)}, \\Sigma_j^{(t+1)})$ (only need to keep sufficient statistics!)\n",
    "\n",
    "* Faster convergence than vanilla EM in some examples.\n",
    "\n",
    "<img src=\"./incrementalEM.png\" width=\"400\" align=\"center\"/>"
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.1.0",
   "language": "julia",
   "name": "julia-1.1"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.1.0"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "85px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_position": {
    "height": "369.3333435058594px",
    "left": "0px",
    "right": "814px",
    "top": "111.33333587646484px",
    "width": "156.1875px"
   },
   "toc_section_display": "block",
   "toc_window_display": true,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}