{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  MM Algorithm (KL Chapter 12)
1.1  MM algorithm
1.2  MM example: t-distribution revisited
1.3  MM example: non-negative matrix factorization (NNMF)
1.4  MM example: Netflix and matrix completion
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# MM Algorithm (KL Chapter 12)\n", "\n", "\n", "\n", "\n", "\n", "## MM algorithm\n", "\n", "* Recall our picture for understanding the ascent property of EM\n", "\n", "\n", "\n", "\n", "\n", "* EM as a minorization-maximization (MM) algorithm\n", " * The $Q$ function constitutes a **minorizing** function of the objective function up to an additive constant\n", "$$\n", "\\begin{eqnarray*}\n", "\tL(\\theta) &\\ge& Q(\\theta|\\theta^{(t)}) + c^{(t)} \\quad \\text{for all } \\theta \\\\\n", "\tL(\\theta^{(t)}) &=& Q(\\theta^{(t)}|\\theta^{(t)}) + c^{(t)}\n", "\\end{eqnarray*} \n", "$$\n", " * **Maximizing** the $Q$ function generates an ascent iterate $\\theta^{(t+1)}$\n", "\n", "* Questions:\n", " 1. Is EM principle only limited to maximizing likelihood model?\n", " 2. Can we flip the picture and apply same principle to **minimization** problem?\n", " 3. Is there any other way to produce such surrogate function? \n", "\n", "* These thoughts lead to a powerful tool - MM algorithm. \n", "**Lange, K.**, Hunter, D. R., and Yang, I. (2000). Optimization transfer using surrogate objective functions. _J. Comput. Graph. Statist._, 9(1):1-59. With discussion, and a rejoinder by Hunter and Lange.\n", "\n", "* For maximization of $f(\\theta)$ - **minorization-maximization (MM)** algorithm\n", " * **M**inorization step: Construct a surrogate function $g(\\theta|\\theta^{(t)})$ that satisfies\n", "$$\n", "\\begin{eqnarray*}\n", "\tf(\\theta) &\\ge& g(\\theta|\\theta^{(t)}) \\quad \\text{(dominance condition)} \\\\\n", "\tf(\\theta^{(t)}) &=& g(\\theta^{(t)}|\\theta^{(t)}) \\quad \\text{(tangent condition)}.\n", "\\end{eqnarray*}\n", "$$\n", " * **M**aximization step: \n", "$$\n", "\t\\theta^{(t+1)} = \\text{argmax} \\, g(\\theta|\\theta^{(t)}).\n", "$$\n", "\n", "* _Ascent_ property of minorization-maximization algorithm\n", "$$\n", "f(\\theta^{(t)}) = g(\\theta^{(t)}|\\theta^{(t)}) \\le g(\\theta^{(t+1)}|\\theta^{(t)}) \\le f(\\theta^{(t+1)}).\n", "$$\n", "\n", "* EM is a special case of minorization-maximization (MM) algorithm.\n", "\n", "* For minimization of $f(\\theta)$ - **majorization-minimization (MM)** algorithm\n", " * **M**ajorization step: Construct a surrogate function $g(\\theta|\\theta^{(t)})$ that satisfies\n", "$$\n", "\\begin{eqnarray*}\n", "\tf(\\theta) &\\le& g(\\theta|\\theta^{(t)}) \\quad \\text{(dominance condition)} \\\\\n", "\tf(\\theta^{(t)}) &=& g(\\theta^{(t)}|\\theta^{(t)}) \\quad \\text{(tangent condition)}.\n", "\\end{eqnarray*}\n", "$$\n", " * **M**inimization step: \n", "$$\n", "\t\\theta^{(t+1)} = \\text{argmin} \\, g(\\theta|\\theta^{(t)}).\n", "$$\n", "\n", "\n", "\n", "* Similarly we have the _descent_ property of majorization-minimization algorithm.\n", "\n", "* Same convergence theory as EM.\n", "\n", "* Rational of the MM principle for developing optimization algorithms\n", " * Free the derivation from missing data structure.\n", " * Avoid matrix inversion.\n", " * Linearize an optimization problem.\n", " * Deal gracefully with certain equality and inequality constraints.\n", " * Turn a non-differentiable problem into a smooth problem.\n", " * Separate the parameters of a problem (perfect for massive, _fine-scale_ parallelization).\n", "\n", "* Generic methods for majorization and minorization - **inequalities**\n", " * Jensen's (information) inequality - EM algorithms\n", " * Supporting hyperplane property of a convex function\n", " * The Cauchy-Schwartz inequality - multidimensional scaling\n", " * Arithmetic-geometric mean inequality\n", " * Quadratic upper bound principle - Bohning and Lindsay\n", " * ...\n", " \n", "* Numerous examples in KL chapter 12, or take Kenneth Lange's optimization course BIOMATH 210.\n", "\n", "* History: the name _MM algorithm_ originates from the discussion (by Xiaoli Meng) and rejoinder of the Lange, Hunter, Yang (2000) paper." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MM example: t-distribution revisited\n", "\n", "* Recall that given iid data $\\mathbf{w}_1, \\ldots, \\mathbf{w}_n$ from multivariate $t$-distribution $t_p(\\mu,\\Sigma,\\nu)$, the log-likelihood is\n", "\\begin{eqnarray*}\n", "\tL(\\mu, \\Sigma, \\nu) &=& - \\frac{np}{2} \\log (\\pi \\nu) + n \\left[ \\log \\Gamma \\left(\\frac{\\nu+p}{2}\\right) - \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) \\right]\t\\\\\n", "\t&& - \\frac n2 \\log \\det \\Sigma + \\frac n2 (\\nu + p) \\log \\nu \\\\\n", "\t&& - \\frac{\\nu + p}{2} \\sum_{j=1}^n \\log [\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma) ].\n", "\\end{eqnarray*}\n", "\n", "* Early on we discussed EM algorithm and its variants for maximizing the log-likelihood.\n", "\n", "* Since $t \\mapsto - \\log t$ is a convex function, we can invoke the supporting hyperplane inequality to minorize the terms $- \\log [\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma)$:\n", "\\begin{eqnarray*}\n", " - \\log [\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma)] &\\ge& - \\log [\\nu^{(t)} + \\delta (\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})] - \\frac{\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma) - \\nu^{(t)} - \\delta (\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})}{\\nu^{(t)} + \\delta (\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})} \\\\\n", " &=& - \\frac{\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma)}{\\nu^{(t)} + \\delta (\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})} + c^{(t)},\n", "\\end{eqnarray*}\n", "where $c^{(t)}$ is a constant irrelevant to optimization.\n", "\n", "* This yields a minorization function\n", "\\begin{eqnarray*}\n", " g(\\mu, \\Sigma, \\nu) &=& - \\frac{np}{2} \\log (\\pi \\nu) + n \\left[ \\log \\Gamma \\left(\\frac{\\nu+p}{2}\\right) - \\log \\Gamma \\left(\\frac{\\nu}{2}\\right) \\right]\t\\\\\n", "\t&& - \\frac n2 \\log \\det \\Sigma + \\frac n2 (\\nu + p) \\log \\nu \\\\\n", "\t&& - \\frac{\\nu + p}{2} \\sum_{j=1}^n \\frac{\\nu + \\delta (\\mathbf{w}_j,\\mu;\\Sigma)}{\\nu^{(t)} + \\delta (\\mathbf{w}_j,\\mu^{(t)};\\Sigma^{(t)})} + c^{(t)}.\n", "\\end{eqnarray*}\n", "Now we can deploy the block ascent strategy. \n", " * Fixing $\\nu$, update of $\\mu$ and $\\Sigma$ are same as in EM algorithm. \n", " * Fixing $\\mu$ and $\\Sigma$, we can update $\\nu$ by working on the observed log-likelihood directly. \n", "\n", "Overall this MM algorithm coincides with ECME.\n", "\n", "* An alternative MM derivation leads to the optimal EM algorithm derived in Meng and van Dyk (1997). See [Wu and Lange (2010)](https://doi.org/10.1214/08-STS264)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MM example: non-negative matrix factorization (NNMF)\n", "\n", "\n", "\n", "* Nonnegative matrix factorization (NNMF) was introduced by Lee and Seung ([1999](http://www.nature.com/nature/journal/v401/n6755/abs/401788a0.html), [2001](https://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)) as an analog of principal components and vector quantization with applications in data compression and clustering. \n", "\n", "* In mathematical terms, one approximates a data matrix $\\mathbf{X} \\in \\mathbb{R}^{m \\times n}$ with nonnegative entries $x_{ij}$ by a product of two low-rank matrices $\\mathbf{V} \\in \\mathbb{R}^{m \\times r}$ and $\\mathbf{W} \\in \\mathbb{R}^{r \\times n}$ with nonnegative entries $v_{ik}$ and $w_{kj}$. \n", "\n", "* Consider minimization of the squared Frobenius norm\n", "$$\n", "\tL(\\mathbf{V}, \\mathbf{W}) = \\|\\mathbf{X} - \\mathbf{V} \\mathbf{W}\\|_{\\text{F}}^2 = \\sum_i \\sum_j \\left( x_{ij} - \\sum_k v_{ik} w_{kj} \\right)^2, \\quad v_{ik} \\ge 0, w_{kj} \\ge 0,\n", "$$\n", "which should lead to a good factorization.\n", "\n", "* $L(\\mathbf{V}, \\mathbf{W})$ is not convex, but bi-convex. The strategy is to alternately update $\\mathbf{V}$ and $\\mathbf{W}$.\n", "\n", "* The key is the majorization, via convexity of the function $(x_{ij} - x)^2$, \n", "$$\n", "\t\\left( x_{ij} - \\sum_k v_{ik} w_{kj} \\right)^2 \\le \\sum_k \\frac{a_{ikj}^{(t)}}{b_{ij}^{(t)}} \\left( x_{ij} - \\frac{b_{ij}^{(t)}}{a_{ikj}^{(t)}} v_{ik} w_{kj} \\right)^2,\n", "$$\n", "where\n", "$$\n", "\ta_{ikj}^{(t)} = v_{ik}^{(t)} w_{kj}^{(t)}, \\quad b_{ij}^{(t)} = \\sum_k v_{ik}^{(t)} w_{kj}^{(t)}.\n", "$$\n", "\n", "* This suggests the alternating multiplicative updates\n", "$$\n", "\\begin{eqnarray*}\n", "\tv_{ik}^{(t+1)} &=& v_{ik}^{(t)} \\frac{\\sum_j x_{ij} w_{kj}^{(t)}}{\\sum_j b_{ij}^{(t)} w_{kj}^{(t)}} \\\\\n", "\tb_{ij}^{(t+1/2)} &=& \\sum_k v_{ik}^{(t+1)} w_{kj}^{(t)} \\\\\n", "\tw_{kj}^{(t+1)} &=& w_{kj}^{(t)} \\frac{\\sum_i x_{ij} v_{ik}^{(t+1)}}{\\sum_i b_{ij}^{(t+1/2)} v_{ik}^{(t+1)}}.\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* The update in matrix notation is extremely simple\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\mathbf{B}^{(t)} &=& \\mathbf{V}^{(t)} \\mathbf{W}^{(t)} \\\\\n", "\t\\mathbf{V}^{(t+1)} &=& \\mathbf{V}^{(t)} \\odot (\\mathbf{X} \\mathbf{W}^{(t)T}) \\oslash (\\mathbf{B}^{(t)} \\mathbf{W}^{(t)T}) \\\\\n", "\t\\mathbf{B}^{(t+1/2)} &=& \\mathbf{V}^{(t+1)} \\mathbf{W}^{(t)} \\\\\n", "\t\\mathbf{W}^{(t+1)} &=& \\mathbf{W}^{(t)} \\odot (\\mathbf{X}^T \\mathbf{V}^{(t+1)}) \\oslash (\\mathbf{B}^{(t+1/2)T} \\mathbf{V}^{(t)}),\n", "\\end{eqnarray*}\n", "$$\n", "where $\\odot$ denotes elementwise multiplication and $\\oslash$ denotes elementwise division. If we start with $v_{ik}, w_{kj} > 0$, parameter iterates stay positive.\n", "\n", "* Monotoniciy of the algorithm is free (from MM principle).\n", "\n", "* Database \\#1 from the MIT Center for Biological and Computational Learning (CBCL, http://cbcl.mit.edu) reduces to a matrix X containing $m = 2,429$ gray-scale face images with $n = 19 \\times 19 = 361$ pixels per face. Each image (row) is scaled to have mean and standard deviation 0.25.\n", "\n", "* My implementation in around **2010**:\n", "\n", "CPU: Intel Core 2 @ 3.2GHZ (4 cores), implemented using BLAS in the GNU Scientific Library (GSL) \n", "GPU: NVIDIA GeForce GTX 280 (240 cores), implemented using cuBLAS.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MM example: Netflix and matrix completion\n", "\n", "* Snapshot of the kind of data collected by Netflix. Only 100,480,507 ratings (1.2% entries of the 480K-by-18K matrix) are observed\n", "\n", "\n", "\n", "* Netflix challenge: impute the unobserved ratings for personalized recommendation.\n", "http://en.wikipedia.org/wiki/Netflix_Prize\n", "\n", "\n", "\n", "* **Matrix completion problem**. Observe a very sparse matrix $\\mathbf{Y} = (y_{ij})$. Want to impute all the missing entries. It is possible only when the matrix is structured, e.g., of _low rank_.\n", "\n", "* Let $\\Omega = \\{(i,j): \\text{observed entries}\\}$ index the set of observed entries and $P_\\Omega(\\mathbf{M})$ denote the projection of matrix $\\mathbf{M}$ to $\\Omega$. The problem\n", "$$\n", "\t\\min_{\\text{rank}(\\mathbf{X}) \\le r} \\frac{1}{2} \\| P_\\Omega(\\mathbf{Y}) - P_\\Omega(\\mathbf{X})\\|_{\\text{F}}^2 = \\frac{1}{2} \\sum_{(i,j) \\in \\Omega} (y_{ij} - x_{ij})^2\n", "$$\n", "unfortunately is non-convex and difficult.\n", "\n", "* _Convex relaxation_: [Mazumder, Hastie, Tibshirani (2010)](https://web.stanford.edu/~hastie/Papers/mazumder10a.pdf)\n", "$$\n", "\t\\min_{\\mathbf{X}} f(\\mathbf{X}) = \\frac{1}{2} \\| P_\\Omega(\\mathbf{Y}) - P_\\Omega(\\mathbf{X})\\|_{\\text{F}}^2 + \\lambda \\|\\mathbf{X}\\|_*,\n", "$$\n", "where $\\|\\mathbf{X}\\|_* = \\|\\sigma(\\mathbf{X})\\|_1 = \\sum_i \\sigma_i(\\mathbf{X})$ is the nuclear norm.\n", "\n", "* **Majorization step**:\n", "$$\n", "\\begin{eqnarray*}\n", "\tf(\\mathbf{X}) &=& \\frac{1}{2} \\sum_{(i,j) \\in \\Omega} (y_{ij} - x_{ij})^2 + \\frac 12 \\sum_{(i,j) \\notin \\Omega} 0 + \\lambda \\|\\mathbf{X}\\|_* \\\\\n", "\t&\\le& \\frac{1}{2} \\sum_{(i,j) \\in \\Omega} (y_{ij} - x_{ij})^2 + \\frac{1}{2} \\sum_{(i,j) \\notin \\Omega} (x_{ij}^{(t)} - x_{ij})^2 + \\lambda \\|\\mathbf{X}\\|_* \\\\\n", "\t&=& \\frac{1}{2} \\| \\mathbf{X} - \\mathbf{Z}^{(t)}\\|_{\\text{F}}^2 + \\lambda \\|\\mathbf{X}\\|_* \\\\\n", "\t&=& g(\\mathbf{X}|\\mathbf{X}^{(t)}),\n", "\\end{eqnarray*}\n", "$$\n", "where $\\mathbf{Z}^{(t)} = P_\\Omega(\\mathbf{Y}) + P_{\\Omega^\\perp}(\\mathbf{X}^{(t)})$. \"fill in missing entries by the current imputation\"\n", "\n", "* **Minimization step**: Rewrite the surrogate function\n", "$$\n", "\\begin{eqnarray*}\n", "\tg(\\mathbf{X}|\\mathbf{X}^{(t)}) = \\frac{1}{2} \\|\\mathbf{X}\\|_{\\text{F}}^2 - \\text{tr}(\\mathbf{X}^T \\mathbf{Z}^{(t)}) + \\frac{1}{2} \\|\\mathbf{Z}^{(t)}\\|_{\\text{F}}^2 + \\lambda \\|\\mathbf{X}\\|_*.\n", "\\end{eqnarray*}\n", "$$\n", "Let $\\sigma_i$ be the (ordered) singular values of $\\mathbf{X}$ and $\\omega_i$ be the (ordered) singular values of $\\mathbf{Z}^{(t)}$. Observe\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\|\\mathbf{X}\\|_{\\text{F}}^2 &=& \\|\\sigma(\\mathbf{X})\\|_2^2 = \\sum_{i} \\sigma_i^2 \\\\\n", "\t\\|\\mathbf{Z}^{(t)}\\|_{\\text{F}}^2 &=& \\|\\sigma(\\mathbf{Z}^{(t)})\\|_2^2 = \\sum_{i} \\omega_i^2\n", "\\end{eqnarray*}\n", "$$\n", "and by the [Fan-von Neuman's inequality](https://en.wikipedia.org/wiki/Trace_inequalities#Von_Neumann.27s_trace_inequality)\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\text{tr}(\\mathbf{X}^T \\mathbf{Z}^{(t)}) \\le \\sum_i \\sigma_i \\omega_i\n", "\\end{eqnarray*}\n", "$$\n", "with equality achieved if and only if the left and right singular vectors of the two matrices coincide. Thus we should choose $\\mathbf{X}$ to have same singular vectors as $\\mathbf{Z}^{(t)}$ and\n", "$$\n", "\\begin{eqnarray*}\n", "\tg(\\mathbf{X}|\\mathbf{X}^{(t)}) &=& \\frac{1}{2} \\sum_i \\sigma_i^2 - \\sum_i \\sigma_i \\omega_i + \\frac{1}{2} \\sum_i \\omega_i^2 + \\lambda \\sum_i \\sigma_i \\\\\n", "\t&=& \\frac{1}{2} \\sum_i (\\sigma_i - \\omega_i)^2 + \\lambda \\sum_i \\sigma_i,\n", "\\end{eqnarray*}\n", "$$\n", "with minimizer given by **singular value thresholding**\n", "$$\n", " \\sigma_i^{(t+1)} = \\max(0, \\omega_i - \\lambda) = (\\omega_i - \\lambda)_+.\n", "$$\n", "\n", "* Algorithm for matrix completion:\n", " * Initialize $\\mathbf{X}^{(0)} \\in \\mathbb{R}^{m \\times n}$\n", " * Repeat\n", " * $\\mathbf{Z}^{(t)} \\gets P_\\Omega(\\mathbf{Y}) + P_{\\Omega^\\perp}(\\mathbf{X}^{(t)})$ \n", " * Compute SVD: $\\mathbf{U} \\text{diag}(\\mathbf{w}) \\mathbf{V}^T \\gets \\mathbf{Z}^{(t)}$ \n", " * $\\mathbf{X}^{(t+1)} \\gets \\mathbf{U} \\text{diag}[(\\mathbf{w}-\\lambda)_+] \\mathbf{V}^T$\n", " * objective value converges\n", "\n", "* \"Golub-Kahan-Reinsch\" algorithm takes $4m^2n + 8mn^2 + 9n^3$ flops for an $m \\ge n$ matrix and is **not** going to work for 480K-by-18K Netflix matrix.\n", "\n", "* Notice only top singular values are needed since low rank solutions are achieved by large $\\lambda$. Lanczos/Arnoldi algorithm is the way to go. Matrix-vector multiplication $\\mathbf{Z}^{(t)} \\mathbf{v}$ is fast (why?)" ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.1.0", "language": "julia", "name": "julia-1.1" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.1.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "82px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": true, "toc_position": { "height": "555.5999755859375px", "left": "0px", "right": "1069.199951171875px", "top": "109.4000015258789px", "width": "181.1999969482422px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }