{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Derivation of SGPR equations\n", "\n", "*James Hensman, March 2016. Corrections by Alex Matthews, December 2016*\n", "\n", "This notebook contains a derivation of the form of the equations for the marginal likelihood bound and predictions for the sparse Gaussian process regression model in GPflow, `gpflow.models.SGPR`.\n", "\n", "The primary reference for this work is Titsias 2009 [1], though other works (Hensman et al. 2013 [2], Matthews et al. 2016 [3]) are useful for clarifying the prediction density." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Marginal likelihood bound\n", "The bound on the marginal likelihood (Titsias 2009) is:\n", "\n", "$$\n", "\\log p(\\mathbf y) \\geq \\log \\mathcal N(\\mathbf y\\,|\\,\\mathbf 0,\\, \\mathbf Q_{ff} + \\sigma^2 \\mathbf I) - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff}) \\triangleq \\mathcal L\n", "$$\n", "...where:\n", "$$\n", "\\mathbf Q_{ff} = \\mathbf K_{fu}\\mathbf K_{uu}^{-1}\\mathbf K_{uf}\n", "$$\n", "\n", "\n", "The kernel matrices $\\mathbf K_{ff}$, $\\mathbf K_{uu}$, $\\mathbf K_{fu}$ represent the kernel evaluated at the data points $\\mathbf X$, the inducing input points $\\mathbf Z$, and between the data and inducing points respectively. We refer to the value of the GP at the data points $\\mathbf X$ as $\\mathbf f$, at the inducing points $\\mathbf Z$ as $\\mathbf u$, and at the remainder of the function as $f^\\star$.\n", "\n", "To obtain an efficient and stable evaluation on the bound $\\mathcal L$, we first apply the Woodbury identity to the effective covariance matrix:\n", "\n", "$$\n", "[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} = \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}[\\mathbf K_{uu} + \\mathbf K_{uf}\\mathbf K_{fu}\\sigma^{-2}]^{-1}\\mathbf K_{uf}\n", "$$\n", "\n", "Now, to obtain a better conditioned matrix for inversion, we rotate by $\\mathbf L$, where $\\mathbf L\\mathbf L^\\top = \\mathbf K_{uu}$:\n", "\n", "$$\n", "[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} = \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}\\color{red}{\\mathbf L^{-\\top} \\mathbf L^\\top}[\\mathbf K_{uu} + \\mathbf K_{uf}K_{fu}\\sigma^{-2}]^{-1}\\color{red}{\\mathbf L \\mathbf L^{-1}}\\mathbf K_{uf}\n", "$$\n", "\n", "This matrix is better conditioned because, for many kernels, it has eigenvalues bounded above and below. For more details, see section 3.4.3 of [Gaussian Processes for Machine Learning](http://www.gaussianprocess.org/gpml/chapters/RW.pdf).\n", "\n", "$$\n", "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1}} = \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}\\color{red}{\\mathbf L^{-\\top}} [\\color{red}{\\mathbf L^{-1}}\\mathbf (K_{uu} + \\mathbf K_{uf}K_{fu}\\sigma^{-2})\\color{red}{\\mathbf L^{-\\top}}]^{-1}\\color{red}{ \\mathbf L^{-1}}\\mathbf K_{uf}\n", "$$\n", "\n", "$$\n", "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} }= \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}\\color{red}{\\mathbf L^{-\\top}} [\\mathbf I + \\color{red}{\\mathbf L^{-1}}\\mathbf (\\mathbf K_{uf}K_{fu})\\color{red}{\\mathbf L^{-\\top}}\\sigma^{-2}]^{-1}\\color{red}{ \\mathbf L^{-1}}\\mathbf K_{uf}\n", "$$\n", "\n", "For notational convenience, we'll define $\\mathbf L^{-1}\\mathbf K_{uf}\\sigma^{-1} \\triangleq \\mathbf A$, and $[\\mathbf I + \\mathbf A\\mathbf A^\\top]\\triangleq \\mathbf B$:\n", "\n", "$$\n", "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} }= \\sigma^{-2} \\mathbf I - \\sigma^{-2} \\mathbf A^{\\top} [\\mathbf I + \\mathbf A\\mathbf A^\\top]^{-1}\\mathbf A\n", "$$\n", "\n", "\\begin{equation}\n", "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1}}= \\sigma^{-2} \\mathbf I - \\sigma^{-2} \\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A\n", "\\end{equation}\n", "\n", "We also apply the [matrix determinant lemma](https://en.wikipedia.org/wiki/Matrix_determinant_lemma) to the same:\n", "\n", "$$\n", "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |{\\mathbf K_{uu}} + \n", " \\mathbf K_{uf}\\mathbf K_{fu}\\sigma^{-2}| \\, |\\mathbf K_{uu}^{-1}| \\, |\\sigma^{2}\\mathbf I|\n", "$$\n", "\n", "Substituting $\\mathbf K_{uu} = {\\mathbf {L L}^\\top}$:\n", "$$\n", "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |{\\mathbf {L L}^\\top} + \n", " \\mathbf K_{uf}\\mathbf K_{fu}\\sigma^{-2}| \\, |\\mathbf L^{-\\top}|\\,| \\mathbf L^{-1}| \\, |\\sigma^{2}\\mathbf I|\n", "$$\n", "\n", "$$\n", "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |\\mathbf I + \n", " \\mathbf L^{-1}\\mathbf K_{uf}\\mathbf K_{fu} \\mathbf L^{-\\top}\\sigma^{-2}| \\, |\\sigma^{2}\\mathbf I|\n", "$$\n", "\n", "$$\n", "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |\\mathbf B| \\, |\\sigma^{2}\\mathbf I|\n", "$$\n", "\n", "With these two definitions, we're ready to expand the bound:\n", "\n", "$$\n", "\\mathcal L = \\log \\mathcal N(\\mathbf y\\,|\\,\\mathbf 0,\\, \\mathbf Q_{ff} + \\sigma^2 \\mathbf I) - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff})\n", "$$\n", "\n", "$$\n", "= -\\tfrac{N}{2}\\log{2\\pi} -\\tfrac{1}{2}\\log|\\mathbf Q_{ff}+\\sigma^2\\mathbf I| - \\tfrac{1}{2}\\mathbf y^\\top [ \\mathbf Q_{ff} + \\sigma^2 \\mathbf I]^{-1}\\mathbf y - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff})\n", "$$\n", "\n", "$$\n", "= -\\tfrac{N}{2}\\log{2\\pi} -\\tfrac{1}{2}\\log(|\\mathbf B||\\sigma^{2}\\mathbf I|) - \\tfrac{1}{2}\\sigma^{-2}\\mathbf y^\\top (\\mathbf I - \\sigma^{-2} \\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A)\\mathbf y - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff})\n", "$$\n", "\n", "$$\n", "= -\\tfrac{N}{2}\\log{2\\pi} \n", "-\\tfrac{1}{2}\\log|\\mathbf B|\n", "-\\tfrac{N}{2}\\log\\sigma^{2}\n", "-\\tfrac{1}{2}\\sigma^{-2}\\mathbf y^\\top\\mathbf y\n", "+\\tfrac{1}{2}\\sigma^{-2}\\mathbf y^\\top\\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A\\mathbf y\n", "-\\tfrac{1}{2}\\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff})\n", "+ \\tfrac{1}{2}\\textrm{tr}(\\mathbf {AA}^\\top)\n", "$$\n", "\n", "...where $\\sigma^{-2}\\textrm{tr}(\\mathbf Q) = \\textrm{tr}(\\mathbf {AA}^\\top)$.\n", "\n", "Finally, we define $\\mathbf c \\triangleq \\mathbf L_{\\mathbf B}^{-1}\\mathbf A\\mathbf y \\sigma^{-1}$, with $\\mathbf {L_BL_B}^\\top = \\mathbf B$, so that:\n", "\n", "$$\n", "\\sigma^{-2}\\mathbf y^\\top\\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A\\mathbf y = \n", "\\mathbf c^\\top \\mathbf c\n", "$$\n", "\n", "The `SGPR` code implements this equation with small changes for multiple concurrent outputs (columns of the data matrix Y), and also a prior mean function.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction\n", "At prediction time, we need to compute the mean and variance of the variational approximation at some new points $\\mathbf X^\\star$.\n", "\n", "Following Hensman et al. (2013), we know that all the information in the posterior approximation is contained in the Gaussian distribution $q(\\mathbf u)$, which represents the distribution of function values at the inducing points $\\mathbf Z$. Remember that:\n", "\n", "$$\n", "q(\\mathbf u) = \\mathcal N(\\mathbf u\\,|\\, \\mathbf m, \\mathbf \\Lambda)\n", "$$\n", "\n", "...with:\n", "\n", "$$\n", "\\mathbf \\Lambda = \\mathbf K_{uu}^{-1} + \\mathbf K_{uu}^{-1}\\mathbf K_{uf}\\mathbf K_{fu}\\mathbf K_{uu}^{-1} \\sigma^{-2}\n", "$$\n", "\n", "$$\n", "\\mathbf m = \\mathbf \\Lambda^{-1} \\mathbf K_{uu}^{-1}\\mathbf K_{uf}\\mathbf y\\sigma^{-2}\n", "$$\n", "\n", "To make a prediction, we need to integrate:\n", "\n", "$$\n", "p(\\mathbf f^\\star) = \\int p(\\mathbf f^\\star \\,|\\, \\mathbf u) q (\\mathbf u) \\textrm d \\mathbf u\n", "$$\n", "\n", "...with:\n", "\n", "$$\n", "p(\\mathbf f^\\star \\,|\\, \\mathbf u) = \\mathcal N(\\mathbf f^\\star\\,|\\, \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf u, \\,\\mathbf K_{\\star\\star} - \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf K_{u\\star})\n", "$$\n", "\n", "The integral results in:\n", "\n", "$$\n", "p(\\mathbf f^\\star) = \\mathcal N(\\mathbf f^\\star\\,|\\, \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf m, \\,\\mathbf K_{\\star\\star} - \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf K_{u\\star} + \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf \\Lambda^{-1}\\mathbf K_{uu}^{-1}\\mathbf K_{u\\star})\n", "$$\n", "\n", "Note from our above definitions we have:\n", "\n", "$$\n", "\\mathbf K_{uu}^{-1}\\mathbf \\Lambda^{-1}\\mathbf K_{uu}^{-1} = \n", "\\mathbf L^{-\\top}\\mathbf B^{-1}\\mathbf L^{-1}\n", "$$\n", "\n", "...and further:\n", "\n", "$$\n", "\\mathbf K_{uu}^{-1}\\mathbf m = \\mathbf L^{-\\top}\\mathbf L_{\\mathbf B}^{-\\top}\\mathbf c\n", "$$\n", "\n", "...substituting:\n", "\n", "$$\n", "p(\\mathbf f^\\star) = \\mathcal N(\\mathbf f^\\star\\,|\\, \\mathbf K_{\\star u}\\mathbf L^{-\\top}\\mathbf L_{\\mathbf B}^{-\\top}\\mathbf c, \\,\\mathbf K_{\\star\\star} - \\mathbf K_{\\star u}\\mathbf L^{-1}(\\mathbf I - \\mathbf B^{-1})\\mathbf L^{-1}\\mathbf K_{u\\star})\n", "$$\n", "\n", "The code in `SGPR` implements this equation, with an additional switch depending on whether the full covariance matrix is required.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "[1] Titsias, M: Variational Learning of Inducing Variables in Sparse Gaussian Processes, PMLR 5:567-574, 2009\n", "\n", "[2] Hensman et al: Gaussian Processes for Big Data, UAI, 2013\n", "\n", "[3] Matthews et al: On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes, AISTATS, 2016\n" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }