{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#EM-Algorithm-(KL-Chapter-13)\" data-toc-modified-id=\"EM-Algorithm-(KL-Chapter-13)-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>EM Algorithm (KL Chapter 13)</a></div><div class=\"lev2 toc-item\"><a href=\"#Which-statistical-papers-are-most-cited?\" data-toc-modified-id=\"Which-statistical-papers-are-most-cited?-11\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Which statistical papers are most cited?</a></div><div class=\"lev2 toc-item\"><a href=\"#EM-algorithm\" data-toc-modified-id=\"EM-algorithm-12\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>EM algorithm</a></div><div class=\"lev2 toc-item\"><a href=\"#A-canonical-EM-example:-finite-mixture-models\" data-toc-modified-id=\"A-canonical-EM-example:-finite-mixture-models-13\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>A canonical EM example: finite mixture models</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EM Algorithm (KL Chapter 13)\n",
    "\n",
    "## Which statistical papers are most cited?\n",
    "\n",
    "| Paper                                    | Citations (5/15/2017) | Per Year |\n",
    "|------------------------------------------|-----------:|----------:|\n",
    "| Kaplan-Meier ([Kaplan and Meier, 1958](https://scholar.google.com/scholar?q=Nonparametric+estimation+from+incomplete+observations&hl=en))    | 51361          |  871        |\n",
    "| EM algorithm ([Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en))               | 50074          | **1252**         |\n",
    "| Cox model ([Cox, 1972](https://scholar.google.com/scholar?q=Regression+models+and+life-tables&hl=en))                    | 44995          |  **1000**        |\n",
    "| FDR ([Benjamini and Hochberg, 1995](https://scholar.google.com/scholar?q=Controlling+the+false+discovery+rate))       | 39323          |  **1787**        |\n",
    "| Metropolis algorithm ([Metropolis et al., 1953](https://scholar.google.com/scholar?q=Equation+of+state+calculations+by+fast+computing+machines))     | 34905          |  545        |\n",
    "| Unit root test ([Dickey and Fuller, 1979](https://scholar.google.com/scholar?q=Distribution+of+the+estimators+for+autoregressive+time+series+with+a+unit+root)) | 21406          |  563        |\n",
    "| lasso ([Tibshrani, 1996](https://scholar.google.com/scholar?q=Regression+shrinkage+and+selection+via+the+lasso))                  | 19969          |  951        |\n",
    "| bootstrap ([Efron, 1979](https://scholar.google.com/scholar?q=Bootstrap+methods%3A+another+look+at+the+jackknife))                  | 15033          |  396        |\n",
    "| FFT ([Cooley and Tukey, 1965](https://scholar.google.com/scholar?q=An+algorithm+for+the+machine+calculation+of+complex+Fourier+series))            | 12948          |  249        |\n",
    "| Gibbs sampler ([Gelfand and Smith, 1990](https://scholar.google.com/scholar?q=Sampling-based+approaches+to+calculating+marginal+densities))  |  7152         | 265         |\n",
    "\n",
    "* Citation counts from Google Scholar on May 15, 2017.\n",
    "\n",
    "* EM is one of the most influential statistical ideas, finding applications in various branches of science.\n",
    "\n",
    "* History: Landmark paper [Dempster et al., 1977](https://scholar.google.com/scholar?q=Maximum+likelihood+from+incomplete+data+via+the+EM+algorithm&hl=en) on EM algorithm. Same idea appears in parameter estimation in HMM (Baum-Welch algorithm) by [Baum et al., 1970](http://www.jstor.org/stable/2239727?seq=1#page_scan_tab_contents).\n",
    "\n",
    "## EM algorithm\n",
    "\n",
    "* Goal: maximize the log-likelihood of the observed data $\\ln g(\\mathbf{y}|\\theta)$ (optimization!)\n",
    "\n",
    "* Notations: \n",
    "    * $\\mathbf{Y}$: observed data\n",
    "    * $\\mathbf{Z}$: missing data\n",
    "\n",
    "* Idea: choose missing $\\mathbf{Z}$ such that MLE for the complete data is easy.\n",
    "\n",
    "* Let $f(\\mathbf{y},\\mathbf{z} | \\theta)$ be the density of complete data.\n",
    "\n",
    "* Iterative two-step procedure\n",
    "     * **E step**: calculate the conditional expectation\n",
    "$$\n",
    "\tQ(\\theta|\\theta^{(t)}) = \\mathbf{E}_{\\mathbf{Z}|\\mathbf{Y}=\\mathbf{y},\\theta^{(t)}} [ \\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}]\n",
    "$$\n",
    "     * **M step**: maximize $Q(\\theta|\\theta^{(t)})$ to generate the next iterate \n",
    "$$\n",
    "    \\theta^{(t+1)} = \\text{argmax}_{\\theta} \\, Q(\\theta|\\theta^{(t)})\n",
    "$$\n",
    "\n",
    "* (**Ascent property** of EM algorithm) By the information inequality,\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t& & Q(\\theta \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y}|\\theta) \\\\\n",
    "\t&=& \\mathbf{E} [\\ln f(\\mathbf{Y},\\mathbf{Z}|\\theta) | \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)}] - \\ln g(\\mathbf{y}|\\theta) \\\\\n",
    "\t&=& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta)}{g(\\mathbf{Y} \\mid \\theta)} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n",
    "\t&\\le& \\mathbf{E} \\left\\{ \\ln \\left[ \\frac{f(\\mathbf{Y}, \\mathbf{Z} \\mid \\theta^{(t)})}{g(\\mathbf{Y} \\mid \\theta^{(t)})} \\right] \\mid \\mathbf{Y} = \\mathbf{y}, \\theta^{(t)} \\right\\} \\\\\n",
    "\t&=& Q(\\theta^{(t)} \\mid \\theta^{(t)}) - \\ln g(\\mathbf{y} |\\theta^{(t)}).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "Rearranging shows that (**fundamental inequality of EM**)\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\ln g(\\mathbf{y} \\mid \\theta) \\ge Q(\\theta \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)})\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "for all $\\theta$; in particular\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\ln g(\\mathbf{y} \\mid \\theta^{(t+1)}) &\\ge& Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) + \\ln g(\\mathbf{y} \\mid \\theta^{(t)}) \\\\\n",
    "\t&\\ge& \\ln g(\\mathbf{y} \\mid \\theta^{(t)}).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "Obviously we only need \n",
    "$$\n",
    "    Q(\\theta^{(t+1)} \\mid \\theta^{(t)}) - Q(\\theta^{(t)} \\mid \\theta^{(t)}) \\ge 0\n",
    "$$\n",
    "for this ascent property to hold (**generalized EM**).\n",
    "\n",
    "* Intuition? Keep these pictures in mind\n",
    "\n",
    "<img src=\"./mm_illustration.pdf\" width=\"300\" align=\"center\"/>\n",
    "\n",
    "<img src=\"./mm_2Dgraph.pdf\" width=\"300\" align=\"center\"/>\n",
    "\n",
    "* Under mild conditions, $\\theta^{(t)}$ converges to a stationary point of $\\ln g(\\mathbf{y} \\theta)$.\n",
    "\n",
    "* Numerous applications of EM:  \n",
    "finite mixture model, HMM (Baum-Welch algorithm), factor analysis, variance components model aka linear mixed model (LMM), hyper-parameter estimation in empirical Bayes procedure $\\max_{\\alpha} \\int f(\\mathbf{y}|\\theta) \\pi(\\theta|\\alpha) \\, d \\theta$, missing data, group/censorized/truncated model, ...\n",
    "\n",
    "See Chapters 13 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) by Kenneth Lange (2010) for more applications of EM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A canonical EM example: finite mixture models\n",
    "\n",
    "<img src=\"https://images-na.ssl-images-amazon.com/images/I/41ujVg7%2BveL._SX310_BO1,204,203,200_.jpg\" width=\"200\" align=\"center\"/>\n",
    "\n",
    "* Consider Gaussian finite mixture models with density \n",
    "$$\n",
    "h(\\mathbf{y}) = \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j), \\quad \\mathbf{y} \\in \\mathbf{R}^{d},\n",
    "$$\n",
    "where\n",
    "$$\n",
    "\th_j(\\mathbf{y} \\mid \\mu_j, \\Omega_j) = \\left( \\frac{1}{2\\pi} \\right)^{d/2} |\\det(\\Omega_j)|^{-1/2} e^{- \\frac 12 (\\mathbf{y} - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y} - \\mu_j)}\n",
    "$$ \n",
    "are multivariate normals $N_d(\\mu_j, \\Omega_j)$. \n",
    "\n",
    "* Given data $\\mathbf{y}_1,\\ldots,\\mathbf{y}_n$, want to estimate parameters\n",
    "$$\n",
    "\\theta = (\\pi_1, \\ldots, \\pi_k, \\mu_1, \\ldots, \\mu_k, \\Omega_1,\\ldots,\\Omega_k)\n",
    "$$\n",
    "subject to constraint $\\pi_j \\ge 0, \\sum_{j=1}^d \\pi_j = 1, \\Omega_j \\succeq \\mathbf{0}$. \n",
    "\n",
    "* (Incomplete) data log-likelihood is \n",
    "$$\n",
    "\\ln g(\\mathbf{y}_1,\\ldots,\\mathbf{y}_n|\\theta) = \\sum_{i=1}^n \\ln h(\\mathbf{y}_i) = \\sum_{i=1}^n \\ln \\sum_{j=1}^k \\pi_j h_j(\\mathbf{y}_i \\mid \\mu_j, \\Omega_j).\n",
    "$$\n",
    "\n",
    "* Let $z_{ij} = I \\{\\mathbf{y}_i$ comes from group $j \\}$. Complete data likelihood is\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\tf(\\mathbf{y},\\mathbf{z} | \\theta) = \\prod_{i=1}^n \\prod_{j=1}^k [\\pi_j h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)]^{z_{ij}}\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "and thus complete log-likelihood is\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\ln f(\\mathbf{y},\\mathbf{z} | \\theta) = \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)].\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "\n",
    "* **E step**: need to evaluate conditional expectation\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t& & Q(\\theta|\\theta^{(t)})\n",
    "\t= \\mathbf{E} \\left\\{ \\sum_{i=1}^n \\sum_{j=1}^k z_{ij} [\\ln \\pi_j + \\ln h_j(\\mathbf{y}_i|\\mu_j,\\Omega_j)] \\mid \\mathbf{Y}=\\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)} ] \\right\\}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "By Bayes rule, we have\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\tw_{ij}^{(t)} &:=& \\mathbf{E} [z_{ij} \\mid \\mathbf{y}, \\pi^{(t)}, \\mu_1^{(t)}, \\ldots, \\mu_k^{(t)}, \\Omega_1^{(t)}, \\ldots, \\Omega_k^{(t)}] \\\\\n",
    "\t&=& \\frac{\\pi_j^{(t)} h_j(\\mathbf{y}_i|\\mu_j^{(t)},\\Omega_j^{(t)})}{\\sum_{j'=1}^k \\pi_{j'}^{(t)} h_{j'}(\\mathbf{y}_i|\\mu_{j'}^{(t)},\\Omega_{j'}^{(t)})}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "So the Q function becomes\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t& & Q(\\theta|\\theta^{(t)})\n",
    "\t= \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\ln \\pi_j + \\sum_{i=1}^n \\sum_{j=1}^k w_{ij}^{(t)} \\left[ - \\frac{1}{2} \\ln \\det \\Omega_j - \\frac{1}{2} (\\mathbf{y}_i - \\mu_j)^T \\Omega_j^{-1} (\\mathbf{y}_i - \\mu_j) \\right].\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "\n",
    "* **M step**: maximizer of the Q function gives the next iterate\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\pi_j^{(t+1)} &=& \\frac{\\sum_i w_{ij}^{(t)}}{n} \\\\\n",
    "\t\\mu_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} \\mathbf{y}_i}{\\sum_{i=1}^n w_{ij}^{(t)}} \\\\\n",
    "\t\\Omega_j^{(t+1)} &=& \\frac{\\sum_{i=1}^n w_{ij}^{(t)} (\\mathbf{y}_i - \\mu_j^{(t+1)}) (\\mathbf{y}_i - \\mu_j^{(t+1)})^T}{\\sum_i w_{ij}^{(t)}}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "See KL Example 11.3.1 for multinomial MLE. See KL Example 11.2.3 for multivariate normal MLE.\n",
    "\n",
    "* Compare these extremely simple updates to Newton type algorithms!\n",
    "\n",
    "* Also note the ease of parallel computing with this EM algorithm. See, e.g.,  \n",
    "**Suchard, M. A.**; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M. Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. _Journal of Computational and Graphical Statistics_, 2010, 19, 419-438."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.2",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.3"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "85px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}