{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Derivation of SGPR equations\n",
    "\n",
    "*James Hensman, March 2016. Corrections by Alex Matthews, December 2016*\n",
    "\n",
    "This notebook contains a derivation of the form of the equations for the marginal likelihood bound and predictions for the sparse Gaussian process regression model in GPflow, `gpflow.models.SGPR`.\n",
    "\n",
    "The primary reference for this work is Titsias 2009 [1], though other works (Hensman et al. 2013 [2], Matthews et al. 2016 [3]) are useful for clarifying the prediction density."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Marginal likelihood bound\n",
    "The bound on the marginal likelihood (Titsias 2009) is:\n",
    "\n",
    "$$\n",
    "\\log p(\\mathbf y) \\geq \\log \\mathcal N(\\mathbf y\\,|\\,\\mathbf 0,\\, \\mathbf Q_{ff} + \\sigma^2 \\mathbf I) - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff}) \\triangleq \\mathcal L\n",
    "$$\n",
    "...where:\n",
    "$$\n",
    "\\mathbf Q_{ff} = \\mathbf K_{fu}\\mathbf K_{uu}^{-1}\\mathbf K_{uf}\n",
    "$$\n",
    "\n",
    "\n",
    "The kernel matrices $\\mathbf K_{ff}$, $\\mathbf K_{uu}$, $\\mathbf K_{fu}$ represent the kernel evaluated at the data points $\\mathbf X$, the inducing input points $\\mathbf Z$, and between the data and inducing points respectively. We refer to the value of the GP at the data points $\\mathbf X$ as $\\mathbf f$, at the inducing points $\\mathbf Z$ as $\\mathbf u$, and at the remainder of the function as $f^\\star$.\n",
    "\n",
    "To obtain an efficient and stable evaluation on the bound $\\mathcal L$, we first apply the Woodbury identity to the effective covariance matrix:\n",
    "\n",
    "$$\n",
    "[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} = \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}[\\mathbf K_{uu} + \\mathbf K_{uf}\\mathbf K_{fu}\\sigma^{-2}]^{-1}\\mathbf K_{uf}\n",
    "$$\n",
    "\n",
    "Now, to obtain a better conditioned matrix for inversion, we rotate by $\\mathbf L$, where $\\mathbf L\\mathbf L^\\top = \\mathbf K_{uu}$:\n",
    "\n",
    "$$\n",
    "[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} = \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}\\color{red}{\\mathbf L^{-\\top} \\mathbf L^\\top}[\\mathbf K_{uu} + \\mathbf K_{uf}K_{fu}\\sigma^{-2}]^{-1}\\color{red}{\\mathbf L \\mathbf L^{-1}}\\mathbf K_{uf}\n",
    "$$\n",
    "\n",
    "This matrix is better conditioned because, for many kernels, it has eigenvalues bounded above and below. For more details, see section 3.4.3 of [Gaussian Processes for Machine Learning](http://www.gaussianprocess.org/gpml/chapters/RW.pdf).\n",
    "\n",
    "$$\n",
    "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1}} = \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}\\color{red}{\\mathbf L^{-\\top}} [\\color{red}{\\mathbf L^{-1}}\\mathbf (K_{uu} + \\mathbf K_{uf}K_{fu}\\sigma^{-2})\\color{red}{\\mathbf L^{-\\top}}]^{-1}\\color{red}{ \\mathbf L^{-1}}\\mathbf K_{uf}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} }= \\sigma^{-2} \\mathbf I - \\sigma^{-4} \\mathbf K_{fu}\\color{red}{\\mathbf L^{-\\top}} [\\mathbf I + \\color{red}{\\mathbf L^{-1}}\\mathbf (\\mathbf K_{uf}K_{fu})\\color{red}{\\mathbf L^{-\\top}}\\sigma^{-2}]^{-1}\\color{red}{ \\mathbf L^{-1}}\\mathbf K_{uf}\n",
    "$$\n",
    "\n",
    "For notational convenience, we'll define $\\mathbf L^{-1}\\mathbf K_{uf}\\sigma^{-1} \\triangleq \\mathbf A$, and  $[\\mathbf I + \\mathbf A\\mathbf A^\\top]\\triangleq \\mathbf B$:\n",
    "\n",
    "$$\n",
    "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1} }= \\sigma^{-2} \\mathbf I - \\sigma^{-2} \\mathbf A^{\\top} [\\mathbf I + \\mathbf A\\mathbf A^\\top]^{-1}\\mathbf A\n",
    "$$\n",
    "\n",
    "\\begin{equation}\n",
    "\\phantom{[\\mathbf Q_{ff} + \\sigma^2 \\mathbf I ]^{-1}}= \\sigma^{-2} \\mathbf I - \\sigma^{-2} \\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A\n",
    "\\end{equation}\n",
    "\n",
    "We also apply the [matrix determinant lemma](https://en.wikipedia.org/wiki/Matrix_determinant_lemma) to the same:\n",
    "\n",
    "$$\n",
    "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |{\\mathbf K_{uu}} + \n",
    " \\mathbf K_{uf}\\mathbf K_{fu}\\sigma^{-2}| \\, |\\mathbf K_{uu}^{-1}| \\, |\\sigma^{2}\\mathbf I|\n",
    "$$\n",
    "\n",
    "Substituting $\\mathbf K_{uu} = {\\mathbf {L L}^\\top}$:\n",
    "$$\n",
    "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |{\\mathbf {L L}^\\top} + \n",
    " \\mathbf K_{uf}\\mathbf K_{fu}\\sigma^{-2}| \\, |\\mathbf L^{-\\top}|\\,| \\mathbf L^{-1}| \\, |\\sigma^{2}\\mathbf I|\n",
    "$$\n",
    "\n",
    "$$\n",
    "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |\\mathbf I + \n",
    " \\mathbf L^{-1}\\mathbf K_{uf}\\mathbf K_{fu} \\mathbf L^{-\\top}\\sigma^{-2}| \\, |\\sigma^{2}\\mathbf I|\n",
    "$$\n",
    "\n",
    "$$\n",
    "|{\\mathbf Q_{ff}} + \\sigma^2 {\\mathbf I}| = |\\mathbf B| \\, |\\sigma^{2}\\mathbf I|\n",
    "$$\n",
    "\n",
    "With these two definitions, we're ready to expand the bound:\n",
    "\n",
    "$$\n",
    "\\mathcal L = \\log \\mathcal N(\\mathbf y\\,|\\,\\mathbf 0,\\, \\mathbf Q_{ff} + \\sigma^2 \\mathbf I) - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff})\n",
    "$$\n",
    "\n",
    "$$\n",
    "= -\\tfrac{N}{2}\\log{2\\pi} -\\tfrac{1}{2}\\log|\\mathbf Q_{ff}+\\sigma^2\\mathbf I| - \\tfrac{1}{2}\\mathbf y^\\top [ \\mathbf Q_{ff} + \\sigma^2 \\mathbf I]^{-1}\\mathbf y - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff})\n",
    "$$\n",
    "\n",
    "$$\n",
    "= -\\tfrac{N}{2}\\log{2\\pi} -\\tfrac{1}{2}\\log(|\\mathbf B||\\sigma^{2}\\mathbf I|)  - \\tfrac{1}{2}\\sigma^{-2}\\mathbf y^\\top (\\mathbf I - \\sigma^{-2} \\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A)\\mathbf y - \\tfrac{1}{2} \\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff} - \\mathbf Q_{ff})\n",
    "$$\n",
    "\n",
    "$$\n",
    "= -\\tfrac{N}{2}\\log{2\\pi} \n",
    "-\\tfrac{1}{2}\\log|\\mathbf B|\n",
    "-\\tfrac{N}{2}\\log\\sigma^{2}\n",
    "-\\tfrac{1}{2}\\sigma^{-2}\\mathbf y^\\top\\mathbf y\n",
    "+\\tfrac{1}{2}\\sigma^{-2}\\mathbf y^\\top\\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A\\mathbf y\n",
    "-\\tfrac{1}{2}\\sigma^{-2}\\textrm{tr}(\\mathbf K_{ff})\n",
    "+ \\tfrac{1}{2}\\textrm{tr}(\\mathbf {AA}^\\top)\n",
    "$$\n",
    "\n",
    "...where $\\sigma^{-2}\\textrm{tr}(\\mathbf Q) = \\textrm{tr}(\\mathbf {AA}^\\top)$.\n",
    "\n",
    "Finally, we define $\\mathbf c \\triangleq \\mathbf L_{\\mathbf B}^{-1}\\mathbf A\\mathbf y \\sigma^{-1}$, with $\\mathbf {L_BL_B}^\\top = \\mathbf B$, so that:\n",
    "\n",
    "$$\n",
    "\\sigma^{-2}\\mathbf y^\\top\\mathbf A^{\\top} \\mathbf B^{-1}\\mathbf A\\mathbf y = \n",
    "\\mathbf c^\\top \\mathbf c\n",
    "$$\n",
    "\n",
    "The `SGPR` code implements this equation with small changes for multiple concurrent outputs (columns of the data matrix Y), and also a prior mean function.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction\n",
    "At prediction time, we need to compute the mean and variance of the variational approximation at some new points $\\mathbf X^\\star$.\n",
    "\n",
    "Following Hensman et al. (2013), we know that all the information in the posterior approximation is contained in the Gaussian distribution $q(\\mathbf u)$, which represents the distribution of function values at the inducing points $\\mathbf Z$. Remember that:\n",
    "\n",
    "$$\n",
    "q(\\mathbf u) = \\mathcal N(\\mathbf u\\,|\\,  \\mathbf m, \\mathbf \\Lambda)\n",
    "$$\n",
    "\n",
    "...with:\n",
    "\n",
    "$$\n",
    "\\mathbf \\Lambda = \\mathbf K_{uu}^{-1} + \\mathbf K_{uu}^{-1}\\mathbf K_{uf}\\mathbf K_{fu}\\mathbf K_{uu}^{-1} \\sigma^{-2}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\mathbf m = \\mathbf \\Lambda^{-1} \\mathbf K_{uu}^{-1}\\mathbf K_{uf}\\mathbf y\\sigma^{-2}\n",
    "$$\n",
    "\n",
    "To make a prediction, we need to integrate:\n",
    "\n",
    "$$\n",
    "p(\\mathbf f^\\star) = \\int p(\\mathbf f^\\star \\,|\\, \\mathbf u) q (\\mathbf u) \\textrm d \\mathbf u\n",
    "$$\n",
    "\n",
    "...with:\n",
    "\n",
    "$$\n",
    "p(\\mathbf f^\\star \\,|\\, \\mathbf u) = \\mathcal N(\\mathbf f^\\star\\,|\\, \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf u, \\,\\mathbf K_{\\star\\star} - \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf K_{u\\star})\n",
    "$$\n",
    "\n",
    "The integral results in:\n",
    "\n",
    "$$\n",
    "p(\\mathbf f^\\star) = \\mathcal N(\\mathbf f^\\star\\,|\\, \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf m, \\,\\mathbf K_{\\star\\star} - \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf K_{u\\star} + \\mathbf K_{\\star u}\\mathbf K_{uu}^{-1}\\mathbf \\Lambda^{-1}\\mathbf K_{uu}^{-1}\\mathbf K_{u\\star})\n",
    "$$\n",
    "\n",
    "Note from our above definitions we have:\n",
    "\n",
    "$$\n",
    "\\mathbf K_{uu}^{-1}\\mathbf \\Lambda^{-1}\\mathbf K_{uu}^{-1} = \n",
    "\\mathbf L^{-\\top}\\mathbf B^{-1}\\mathbf L^{-1}\n",
    "$$\n",
    "\n",
    "...and further:\n",
    "\n",
    "$$\n",
    "\\mathbf K_{uu}^{-1}\\mathbf m = \\mathbf L^{-\\top}\\mathbf L_{\\mathbf B}^{-\\top}\\mathbf c\n",
    "$$\n",
    "\n",
    "...substituting:\n",
    "\n",
    "$$\n",
    "p(\\mathbf f^\\star) = \\mathcal N(\\mathbf f^\\star\\,|\\, \\mathbf K_{\\star u}\\mathbf L^{-\\top}\\mathbf L_{\\mathbf B}^{-\\top}\\mathbf c, \\,\\mathbf K_{\\star\\star} - \\mathbf K_{\\star u}\\mathbf L^{-1}(\\mathbf I - \\mathbf B^{-1})\\mathbf L^{-1}\\mathbf K_{u\\star})\n",
    "$$\n",
    "\n",
    "The code in `SGPR` implements this equation, with an additional switch depending on whether the full covariance matrix is required.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "[1] Titsias, M: Variational Learning of Inducing Variables in Sparse Gaussian Processes, PMLR 5:567-574, 2009\n",
    "\n",
    "[2] Hensman et al: Gaussian Processes for Big Data, UAI, 2013\n",
    "\n",
    "[3] Matthews et al: On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes, AISTATS, 2016\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}