{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Special Topics: Gaussian Processes\n",
    "### [Neil D. Lawrence](http://inverseprobability.com), University of Sheffield\n",
    "### 2015-12-15\n",
    "\n",
    "\n",
    "$$\n",
    "\\newcommand{\\tk}[1]{}\n",
    "%\\newcommand{\\tk}[1]{\\textbf{TK}: #1}\n",
    "\\newcommand{\\Amatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n",
    "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n",
    "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n",
    "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n",
    "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n",
    "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n",
    "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n",
    "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n",
    "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n",
    "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n",
    "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\aMatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\aScalar}{a}\n",
    "\\newcommand{\\aVector}{\\mathbf{a}}\n",
    "\\newcommand{\\acceleration}{a}\n",
    "\\newcommand{\\bMatrix}{\\mathbf{B}}\n",
    "\\newcommand{\\bScalar}{b}\n",
    "\\newcommand{\\bVector}{\\mathbf{b}}\n",
    "\\newcommand{\\basisFunc}{\\phi}\n",
    "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n",
    "\\newcommand{\\basisFunction}{\\phi}\n",
    "\\newcommand{\\basisLocation}{\\mu}\n",
    "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n",
    "\\newcommand{\\basisScalar}{\\basisFunction}\n",
    "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n",
    "\\newcommand{\\activationFunction}{\\phi}\n",
    "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n",
    "\\newcommand{\\activationScalar}{\\basisFunction}\n",
    "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n",
    "\\newcommand{\\bigO}{\\mathcal{O}}\n",
    "\\newcommand{\\binomProb}{\\pi}\n",
    "\\newcommand{\\cMatrix}{\\mathbf{C}}\n",
    "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n",
    "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n",
    "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n",
    "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n",
    "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n",
    "\\newcommand{\\centeredKernelScalar}{b}\n",
    "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n",
    "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n",
    "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n",
    "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n",
    "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n",
    "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n",
    "\\newcommand{\\coregionalizationScalar}{b}\n",
    "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n",
    "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n",
    "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n",
    "\\newcommand{\\covarianceScalar}{c}\n",
    "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n",
    "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n",
    "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n",
    "\\newcommand{\\croupierScalar}{s}\n",
    "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n",
    "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n",
    "\\newcommand{\\dataDim}{p}\n",
    "\\newcommand{\\dataIndex}{i}\n",
    "\\newcommand{\\dataIndexTwo}{j}\n",
    "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n",
    "\\newcommand{\\dataScalar}{y}\n",
    "\\newcommand{\\dataSet}{\\mathcal{D}}\n",
    "\\newcommand{\\dataStd}{\\sigma}\n",
    "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n",
    "\\newcommand{\\decayRate}{d}\n",
    "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n",
    "\\newcommand{\\degreeScalar}{d}\n",
    "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n",
    "% Already defined by latex\n",
    "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n",
    "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n",
    "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n",
    "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n",
    "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n",
    "\\newcommand{\\displacement}{x}\n",
    "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n",
    "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n",
    "\\newcommand{\\distanceScalar}{d}\n",
    "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n",
    "\\newcommand{\\eigenvaltwo}{\\ell}\n",
    "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n",
    "\\newcommand{\\eigenvalue}{\\lambda}\n",
    "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n",
    "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n",
    "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n",
    "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n",
    "\\newcommand{\\eigenvectorScalar}{u}\n",
    "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n",
    "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n",
    "\\newcommand{\\eigenvectwoScalar}{v}\n",
    "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n",
    "\\newcommand{\\errorFunction}{E}\n",
    "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n",
    "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n",
    "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n",
    "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n",
    "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n",
    "\\newcommand{\\eye}{\\mathbf{I}}\n",
    "\\newcommand{\\fantasyDim}{r}\n",
    "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n",
    "\\newcommand{\\fantasyScalar}{z}\n",
    "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n",
    "\\newcommand{\\featureStd}{\\varsigma}\n",
    "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n",
    "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n",
    "\\newcommand{\\given}{|}\n",
    "\\newcommand{\\half}{\\frac{1}{2}}\n",
    "\\newcommand{\\heaviside}{H}\n",
    "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n",
    "\\newcommand{\\hiddenScalar}{h}\n",
    "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n",
    "\\newcommand{\\identityMatrix}{\\eye}\n",
    "\\newcommand{\\inducingInputScalar}{z}\n",
    "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n",
    "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n",
    "\\newcommand{\\inducingScalar}{u}\n",
    "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n",
    "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n",
    "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n",
    "\\newcommand{\\inputDim}{q}\n",
    "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n",
    "\\newcommand{\\inputScalar}{x}\n",
    "\\newcommand{\\inputSpace}{\\mathcal{X}}\n",
    "\\newcommand{\\inputVals}{\\inputVector}\n",
    "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n",
    "\\newcommand{\\iterNum}{k}\n",
    "\\newcommand{\\kernel}{\\kernelScalar}\n",
    "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n",
    "\\newcommand{\\kernelScalar}{k}\n",
    "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n",
    "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n",
    "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n",
    "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n",
    "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n",
    "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n",
    "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n",
    "\\newcommand{\\lagrangian}{L}\n",
    "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n",
    "\\newcommand{\\laplacianFactorScalar}{m}\n",
    "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n",
    "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\laplacianScalar}{\\ell}\n",
    "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n",
    "\\newcommand{\\latentDim}{q}\n",
    "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n",
    "\\newcommand{\\latentDistanceScalar}{\\delta}\n",
    "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n",
    "\\newcommand{\\latentForce}{f}\n",
    "\\newcommand{\\latentFunction}{u}\n",
    "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n",
    "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n",
    "\\newcommand{\\latentIndex}{j}\n",
    "\\newcommand{\\latentScalar}{z}\n",
    "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n",
    "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n",
    "\\newcommand{\\learnRate}{\\eta}\n",
    "\\newcommand{\\lengthScale}{\\ell}\n",
    "\\newcommand{\\rbfWidth}{\\ell}\n",
    "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n",
    "\\newcommand{\\likelihoodFunction}{L}\n",
    "\\newcommand{\\locationScalar}{\\mu}\n",
    "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n",
    "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n",
    "\\newcommand{\\mappingFunction}{f}\n",
    "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n",
    "\\newcommand{\\mappingFunctionTwo}{g}\n",
    "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n",
    "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n",
    "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n",
    "\\newcommand{\\scaleScalar}{s}\n",
    "\\newcommand{\\mappingScalar}{w}\n",
    "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n",
    "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n",
    "\\newcommand{\\mappingScalarTwo}{v}\n",
    "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n",
    "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n",
    "\\newcommand{\\maxIters}{K}\n",
    "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\meanScalar}{\\mu}\n",
    "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\meanTwoScalar}{m}\n",
    "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n",
    "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n",
    "\\newcommand{\\mrnaConcentration}{m}\n",
    "\\newcommand{\\naturalFrequency}{\\omega}\n",
    "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n",
    "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n",
    "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n",
    "\\newcommand{\\noiseScalar}{\\epsilon}\n",
    "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n",
    "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n",
    "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n",
    "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n",
    "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n",
    "\\newcommand{\\numActive}{m}\n",
    "\\newcommand{\\numBasisFunc}{m}\n",
    "\\newcommand{\\numComponents}{m}\n",
    "\\newcommand{\\numComps}{K}\n",
    "\\newcommand{\\numData}{n}\n",
    "\\newcommand{\\numFeatures}{K}\n",
    "\\newcommand{\\numHidden}{h}\n",
    "\\newcommand{\\numInducing}{m}\n",
    "\\newcommand{\\numLayers}{\\ell}\n",
    "\\newcommand{\\numNeighbors}{K}\n",
    "\\newcommand{\\numSequences}{s}\n",
    "\\newcommand{\\numSuccess}{s}\n",
    "\\newcommand{\\numTasks}{m}\n",
    "\\newcommand{\\numTime}{T}\n",
    "\\newcommand{\\numTrials}{S}\n",
    "\\newcommand{\\outputIndex}{j}\n",
    "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n",
    "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n",
    "\\newcommand{\\parameterScalar}{\\theta}\n",
    "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n",
    "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n",
    "\\newcommand{\\precisionScalar}{j}\n",
    "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n",
    "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n",
    "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n",
    "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n",
    "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n",
    "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n",
    "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n",
    "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n",
    "\\newcommand{\\responsibility}{r}\n",
    "\\newcommand{\\rotationScalar}{r}\n",
    "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n",
    "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n",
    "\\newcommand{\\sampleCovScalar}{s}\n",
    "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n",
    "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n",
    "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n",
    "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n",
    "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n",
    "\\newcommand{\\singularvalue}{\\ell}\n",
    "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n",
    "\\newcommand{\\sorth}{\\mathbf{u}}\n",
    "\\newcommand{\\spar}{\\lambda}\n",
    "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n",
    "\\newcommand{\\BasalRate}{B}\n",
    "\\newcommand{\\DampingCoefficient}{C}\n",
    "\\newcommand{\\DecayRate}{D}\n",
    "\\newcommand{\\Displacement}{X}\n",
    "\\newcommand{\\LatentForce}{F}\n",
    "\\newcommand{\\Mass}{M}\n",
    "\\newcommand{\\Sensitivity}{S}\n",
    "\\newcommand{\\basalRate}{b}\n",
    "\\newcommand{\\dampingCoefficient}{c}\n",
    "\\newcommand{\\mass}{m}\n",
    "\\newcommand{\\sensitivity}{s}\n",
    "\\newcommand{\\springScalar}{\\kappa}\n",
    "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n",
    "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n",
    "\\newcommand{\\tfConcentration}{p}\n",
    "\\newcommand{\\tfDecayRate}{\\delta}\n",
    "\\newcommand{\\tfMrnaConcentration}{f}\n",
    "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n",
    "\\newcommand{\\velocity}{v}\n",
    "\\newcommand{\\sufficientStatsScalar}{g}\n",
    "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n",
    "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n",
    "\\newcommand{\\switchScalar}{s}\n",
    "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n",
    "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n",
    "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n",
    "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n",
    "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n",
    "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n",
    "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n",
    "\\newcommand{\\vScalar}{v}\n",
    "\\newcommand{\\vVector}{\\mathbf{v}}\n",
    "\\newcommand{\\vMatrix}{\\mathbf{V}}\n",
    "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n",
    "% Already defined by latex\n",
    "%\\newcommand{\\vec}{#1:}\n",
    "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n",
    "\\newcommand{\\weightScalar}{w}\n",
    "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n",
    "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n",
    "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\weightedAdjacencyScalar}{a}\n",
    "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n",
    "\\newcommand{\\onesVector}{\\mathbf{1}}\n",
    "\\newcommand{\\zerosVector}{\\mathbf{0}}\n",
    "$$\n",
    "\n",
    "<!-- Front matter -->\n",
    "<!-- Front matter -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!---->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!--Back matter-->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->\n",
    "## Review\n",
    "\n",
    "-   Last week: Logistic Regression and Generalised Linear Models\n",
    "-   Introduced link functions and different transformations.\n",
    "-   Showed examples in classification and mentioned possibilities for\n",
    "    disease rate models.\n",
    "-   This week:\n",
    "    -   Gaussian Processes: non parametric Bayesian modelling }\n",
    "\n",
    "Over the last two sessions we've begun considering classification models\n",
    "and logistic regresssion. In particular, for naive Bayes, we considered\n",
    "a set of assumptions that allowed us to build a joint model of our data\n",
    "set. In particular for naive Bayes we specified\n",
    "\n",
    "1.  Data conditional independence.\n",
    "2.  Feature conditional independence.\n",
    "3.  Marginal likelihood of labels was Bernoulli distributed.\n",
    "\n",
    "This allowed us to specify the joint density of our labels and our input\n",
    "data, $p(\\dataVector, \\inputMatrix|\\boldsymbol{\\theta})$. And we\n",
    "conditioned on the training data to make predictions about the test\n",
    "data.\n",
    "\n",
    "## Generalized Linear Models\n",
    "\n",
    "Logistic regression is part of a wider class of models known as\n",
    "*generalized linear models*. In these models we determine that some\n",
    "characteristic of the model is speicified by a function that is liniear\n",
    "in the parameters. So we might suggest that $$\n",
    "\\log \\frac{p(\\inputVector)}{1-p(\\inputVector)} = \\mappingFunction(\\inputVector; \\mappingVector)\n",
    "$$ where $\\mappingFunction(\\inputVector; \\mappingVector)$ is a\n",
    "linear-in-the-parameters function (here the parameters are\n",
    "$\\mappingVector$, which is generally non-linear in the inputs. So far we\n",
    "have considered basis function models of the form $$\n",
    "\\mappingFunction(\\inputVector) =\n",
    "\\mappingVector^\\top \\basisVector(\\inputVector).\n",
    "$$ When we form a Gaussian process we do something that is slightly more\n",
    "akin to the naive Bayes approach, but actually is closely related to the\n",
    "generalized linear model approach.\n",
    "\n",
    "## Gaussian Processes \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-intro-lectures.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "Models where we model the entire joint distribution of our training\n",
    "data, $p(\\dataVector, \\inputMatrix)$ are sometimes described as\n",
    "*generative models*. Because we can use sampling to generate data sets\n",
    "that represent all our assumptions. However, as we discussed in the\n",
    "sessions on \\refnotes{logistic regression}{logistic-regression} and\n",
    "\\refnotes{naive Bayes}{naive-bayes}, this can be a bad idea, because if\n",
    "our assumptions are wrong then we can make poor predictions. We can try\n",
    "to make more complex assumptions about data to alleviate the problem,\n",
    "but then this typically leads to challenges for tractable application of\n",
    "the sum and rules of probability that are needed to compute the relevant\n",
    "marginal and conditional densities. If we know the form of the question\n",
    "we wish to answer then we typically try and represent that directly,\n",
    "through $p(\\dataVector|\\inputMatrix)$. In practice, we also have been\n",
    "making assumptions of conditional independence given the model\n",
    "parameters, $$\n",
    "p(\\dataVector|\\inputMatrix, \\mappingVector) =\n",
    "\\prod_{i=1}^{\\numData} p(\\dataScalar_i | \\inputVector_i, \\mappingVector)\n",
    "$$ Gaussian processes are *not* normally considered to be *generative\n",
    "models*, but we will be much more interested in the principles of\n",
    "conditioning in Gaussian processes because we will use conditioning to\n",
    "make predictions between our test and training data. We will avoid the\n",
    "data conditional indpendence assumption in favour of a richer assumption\n",
    "about the data, in a Gaussian process we assume data is *jointly\n",
    "Gaussian* with a particular mean and covariance, $$\n",
    "\\dataVector|\\inputMatrix \\sim \\gaussianSamp{\\mathbf{m}(\\inputMatrix)}{\\kernelMatrix(\\inputMatrix)},\n",
    "$$ where the conditioning is on the inputs $\\inputMatrix$ which are used\n",
    "for computing the mean and covariance. For this reason they are known as\n",
    "mean and covariance functions.\n",
    "\n",
    "## Prediction Across Two Points with GPs \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/gptwopointpred.md\" target=\"_blank\" >edit</a>\\]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.random.seed(4949)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', \n",
    "                            '../slides/diagrams/gp', \n",
    "                            sample=IntSlider(9, 9, 12, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/two_point_sample012.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>The joint Gaussian over $\\mappingFunction_1$ and\n",
    "$\\mappingFunction_2$ along with the conditional distribution of\n",
    "$\\mappingFunction_2$ given $\\mappingFunction_1$</i>\n",
    "\n",
    "-   The single contour of the Gaussian density represents the\n",
    "    <font color=\"black\">joint distribution,\n",
    "    $p(\\mappingFunction_1, \\mappingFunction_2)$</font>\n",
    "\n",
    ". . .\n",
    "\n",
    "-   We observe that <font color=\"black\">$\\mappingFunction_1=?$</font>\n",
    "\n",
    ". . .\n",
    "\n",
    "-   Conditional density:\n",
    "    <font color=\"black\">$p(\\mappingFunction_2|\\mappingFunction_1=?)$</font>\n",
    "\n",
    "-   Prediction of $\\mappingFunction_2$ from $\\mappingFunction_1$\n",
    "    requires *conditional density*.\n",
    "\n",
    "-   Conditional density is *also* Gaussian. $$\n",
    "    p(\\mappingFunction_2|\\mappingFunction_1) = {\\mathcal{N}\\left(\\mappingFunction_2|\\frac{\\kernelScalar_{1, 2}}{\\kernelScalar_{1, 1}}\\mappingFunction_1,\\kernelScalar_{2, 2} - \\frac{\\kernelScalar_{1,2}^2}{\\kernelScalar_{1,1}}\\right)}\n",
    "    $$ where covariance of joint density is given by $$\n",
    "    \\kernelMatrix= \\begin{bmatrix} \\kernelScalar_{1, 1} & \\kernelScalar_{1, 2}\\\\ \\kernelScalar_{2, 1} & \\kernelScalar_{2, 2}\\end{bmatrix}\n",
    "    $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', \n",
    "                            '../slides/diagrams/gp', \n",
    "                            sample=IntSlider(13, 13, 17, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/two_point_sample013.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Sample from the joint Gaussian model, points indexed by 1 and\n",
    "8 highlighted.</i>\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/two_point_sample017.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>The joint Gaussian over $\\mappingFunction_1$ and\n",
    "$\\mappingFunction_8$ along with the conditional distribution of\n",
    "$\\mappingFunction_8$ given $\\mappingFunction_1$</i>\n",
    "\n",
    "-   The single contour of the Gaussian density represents the\n",
    "    <font color=\"black\">joint distribution,\n",
    "    $p(\\mappingFunction_1, \\mappingFunction_8)$</font>\n",
    "\n",
    ". . .\n",
    "\n",
    "-   We observe a value for\n",
    "    <font color=\"black\">$\\mappingFunction_1=-?$</font>\n",
    "\n",
    ". . .\n",
    "\n",
    "-   Conditional density:\n",
    "    <font color=\"black\">$p(\\mappingFunction_5|\\mappingFunction_1=?)$</font>.\n",
    "\n",
    "-   Prediction of $\\mappingFunctionVector_*$ from\n",
    "    $\\mappingFunctionVector$ requires multivariate *conditional\n",
    "    density*.\n",
    "\n",
    "-   Multivariate conditional density is *also* Gaussian. <large> $$\n",
    "    p(\\mappingFunctionVector_*|\\mappingFunctionVector) = {\\mathcal{N}\\left(\\mappingFunctionVector_*|\\kernelMatrix_{*,\\mappingFunctionVector}\\kernelMatrix_{\\mappingFunctionVector,\\mappingFunctionVector}^{-1}\\mappingFunctionVector,\\kernelMatrix_{*,*}-\\kernelMatrix_{*,\\mappingFunctionVector} \\kernelMatrix_{\\mappingFunctionVector,\\mappingFunctionVector}^{-1}\\kernelMatrix_{\\mappingFunctionVector,*}\\right)}\n",
    "    $$ </large>\n",
    "\n",
    "-   Here covariance of joint density is given by $$\n",
    "    \\kernelMatrix= \\begin{bmatrix} \\kernelMatrix_{\\mappingFunctionVector, \\mappingFunctionVector} & \\kernelMatrix_{*, \\mappingFunctionVector}\\\\ \\kernelMatrix_{\\mappingFunctionVector, *} & \\kernelMatrix_{*, *}\\end{bmatrix}\n",
    "    $$\n",
    "\n",
    "-   Prediction of $\\mappingFunctionVector_*$ from\n",
    "    $\\mappingFunctionVector$ requires multivariate *conditional\n",
    "    density*.\n",
    "\n",
    "-   Multivariate conditional density is *also* Gaussian. <large> $$\n",
    "    p(\\mappingFunctionVector_*|\\mappingFunctionVector) = {\\mathcal{N}\\left(\\mappingFunctionVector_*|{\\boldsymbol{{\\mu}}},\\boldsymbol{\\Sigma}\\right)}\n",
    "    $$ $$\n",
    "    {\\boldsymbol{{\\mu}}}= \\kernelMatrix_{*,\\mappingFunctionVector}\\kernelMatrix_{\\mappingFunctionVector,\\mappingFunctionVector}^{-1}\\mappingFunctionVector\n",
    "    $$\n",
    "    $$\\boldsymbol{\\Sigma} = \\kernelMatrix_{*,*}-\\kernelMatrix_{*,\\mappingFunctionVector} \\kernelMatrix_{\\mappingFunctionVector,\\mappingFunctionVector}^{-1}\\kernelMatrix_{\\mappingFunctionVector,*}\n",
    "    $$ </large>\n",
    "\n",
    "-   Here covariance of joint density is given by $$\n",
    "    \\kernelMatrix= \\begin{bmatrix} \\kernelMatrix_{\\mappingFunctionVector, \\mappingFunctionVector} & \\kernelMatrix_{*, \\mappingFunctionVector}\\\\ \\kernelMatrix_{\\mappingFunctionVector, *} & \\kernelMatrix_{*, *}\\end{bmatrix}\n",
    "    $$\n",
    "\n",
    "## Marginal Likelihood \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/gaussian-processes.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "To understand the Gaussian process we're going to build on our\n",
    "understanding of the marginal likelihood for Bayesian regression. In the\n",
    "session on \\refnotes{Bayesian regression}{bayesian-regression} we\n",
    "sampled directly from the weight vector, $\\mappingVector$ and applied it\n",
    "to the basis matrix $\\basisMatrix$ to obtain a sample from the prior and\n",
    "a sample from the posterior. It is often helpful to think of modeling\n",
    "techniques as *generative* models. To give some thought as to what the\n",
    "process for obtaining data from the model is. From the perspective of\n",
    "Gaussian processes, we want to start by thinking of basis function\n",
    "models, where the parameters are sampled from a prior, but move to\n",
    "thinking about sampling from the marginal likelihood directly.\n",
    "\n",
    "## Sampling from the Prior\n",
    "\n",
    "The first thing we'll do is to set up the parameters of the model, these\n",
    "include the parameters of the prior, the parameters of the basis\n",
    "functions and the noise level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set prior variance on w\n",
    "alpha = 4.\n",
    "# set the order of the polynomial basis set\n",
    "degree = 5\n",
    "# set the noise variance\n",
    "sigma2 = 0.01"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have the variance, we can sample from the prior distribution to\n",
    "see what form we are imposing on the functions *a priori*.\n",
    "\n",
    "Let's now compute a range of values to make predictions at, spanning the\n",
    "*new* space of inputs,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def polynomial(x, degree, loc, scale):\n",
    "    degrees = np.arange(degree+1)\n",
    "    return ((x-loc)/scale)**degrees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "now let's build the basis matrices. First we load in the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loc = 1950.\n",
    "scale = 100.\n",
    "num_data = x.shape[0]\n",
    "num_pred_data = 100 # how many points to use for plotting predictions\n",
    "x_pred = np.linspace(1880, 2030, num_pred_data)[:, None] # input locations for predictions\n",
    "Phi_pred = polynomial(x_pred, degree=degree, loc=loc, scale=scale)\n",
    "Phi = polynomial(x, degree=degree, loc=loc, scale=scale)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Weight Space View\n",
    "\n",
    "To generate typical functional predictions from the model, we need a set\n",
    "of model parameters. We assume that the parameters are drawn\n",
    "independently from a Gaussian density, $$\n",
    "\\weightVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\eye},\n",
    "$$ then we can combine this with the definition of our prediction\n",
    "function $\\mappingFunction(\\inputVector)$, $$\n",
    "\\mappingFunction(\\inputVector) =\n",
    "\\weightVector^\\top \\basisVector(\\inputVector).\n",
    "$$ We can now sample from the prior density to obtain a vector\n",
    "$\\weightVector$ using the function `np.random.normal` and combine these\n",
    "parameters with our basis to create some samples of what\n",
    "$\\mappingFunction(\\inputVector)$ looks like,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_samples = 10\n",
    "K = degree+1\n",
    "for i in range(num_samples):\n",
    "    z_vec = np.random.normal(size=(K, 1))\n",
    "    w_sample = z_vec*np.sqrt(alpha)\n",
    "    f_sample = np.dot(Phi_pred,w_sample)\n",
    "    plt.plot(x_pred, f_sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Function Space View\n",
    "\n",
    "The process we have used to generate the samples is a two stage process.\n",
    "To obtain each function, we first generated a sample from the prior, $$\n",
    "\\weightVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha \\eye}\n",
    "$$ then if we compose our basis matrix, $\\basisMatrix$ from the basis\n",
    "functions associated with each row then we get, $$\n",
    "\\basisMatrix = \\begin{bmatrix}\\basisVector(\\inputVector_1) \\\\ \\vdots \\\\\n",
    "\\basisVector(\\inputVector_n)\\end{bmatrix}\n",
    "$$ then we can write down the vector of function values, as evaluated at\n",
    "$$\n",
    "\\mappingFunctionVector = \\begin{bmatrix} \\mappingFunction_1\n",
    "\\\\ \\vdots \\mappingFunction_n\\end{bmatrix}\n",
    "$$ in the form $$\n",
    "\\mappingFunctionVector = \\basisMatrix\n",
    "\\weightVector.\n",
    "$$}\n",
    "\n",
    "Now we can use standard properties of multivariate Gaussians to write\n",
    "down the probability density that is implied over\n",
    "$\\mappingFunctionVector$. In particular we know that if $\\weightVector$\n",
    "is sampled from a multivariate normal (or multivariate Gaussian) with\n",
    "covariance $\\alpha \\eye$ and zero mean, then assuming that\n",
    "$\\basisMatrix$ is a deterministic matrix (i.e. it is not sampled from a\n",
    "probability density) then the vector $\\mappingFunctionVector$ will also\n",
    "be distributed according to a zero mean multivariate normal as follows,\n",
    "$$\n",
    "\\mappingFunctionVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha \\basisMatrix\\basisMatrix^\\top}.\n",
    "$$\n",
    "\n",
    "The question now is, what happens if we sample $\\mappingFunctionVector$\n",
    "directly from this density, rather than first sampling $\\weightVector$\n",
    "and then multiplying by $\\basisMatrix$. Let's try this. First of all we\n",
    "define the covariance as $$\n",
    "\\kernelMatrix = \\alpha\n",
    "\\basisMatrix\\basisMatrix^\\top.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "K = alpha*np.dot(Phi_pred, Phi_pred.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use the `np.random.multivariate_normal` command for sampling\n",
    "from a multivariate normal with covariance given by $\\kernelMatrix$ and\n",
    "zero mean,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in np.arange(10):\n",
    "    f_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)\n",
    "    plt.plot(x_pred.flatten(), f_sample.flatten())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The samples appear very similar to those which we obtained indirectly.\n",
    "That is no surprise because they are effectively drawn from the same\n",
    "mutivariate normal density. However, when sampling\n",
    "$\\mappingFunctionVector$ directly we created the covariance for\n",
    "$\\mappingFunctionVector$. We can visualise the form of this covaraince\n",
    "in an image in python with a colorbar to show scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8,8))\n",
    "im = ax.imshow(K, interpolation='none')\n",
    "fig.colorbar(im)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "{This image is the covariance expressed between different points on the\n",
    "function. In regression we normally also add independent Gaussian noise\n",
    "to obtain our observations $\\dataVector$, $$\n",
    "\\dataVector = \\mappingFunctionVector + \\boldsymbol{\\epsilon}\n",
    "$$ where the noise is sampled from an independent Gaussian distribution\n",
    "with variance $\\dataStd^2$, $$\n",
    "\\epsilon \\sim \\gaussianSamp{\\zerosVector}{\\dataStd^2\\eye}.\n",
    "$$ we can use properties of Gaussian variables, i.e. the fact that sum\n",
    "of two Gaussian variables is also Gaussian, and that it's covariance is\n",
    "given by the sum of the two covariances, whilst the mean is given by the\n",
    "sum of the means, to write down the marginal likelihood, $$\n",
    "\\dataVector \\sim \\gaussianSamp{\\zerosVector}{\\basisMatrix\\basisMatrix^\\top +\\dataStd^2\\eye}.\n",
    "$$ Sampling directly from this density gives us the noise corrupted\n",
    "functions,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)\n",
    "for i in range(10):\n",
    "    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)\n",
    "    plt.plot(x_pred.flatten(), y_sample.flatten())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where the effect of our noise term is to roughen the sampled functions,\n",
    "we can also increase the variance of the noise to see a different\n",
    "effect,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sigma2 = 1.\n",
    "K = alpha*np.dot(Phi_pred, Phi_pred.T) + sigma2*np.eye(x_pred.size)\n",
    "for i in range(10):\n",
    "    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)\n",
    "    plt.plot(x_pred.flatten(), y_sample.flatten())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "**Function Space Reflection** How do you include the noise term when\n",
    "sampling in the weight space point of view?\n",
    "\n",
    "### Write your answer to Exercise here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use this box for any code you need for the exercise\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Non-degenerate Gaussian Processes \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/non-degenerate-gps.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The process described above is degenerate. The covariance function is of\n",
    "rank at most $\\numHidden$ and since the theoretical amount of data could\n",
    "always increase $\\numData \\rightarrow \\infty$, the covariance function\n",
    "is not full rank. This means as we increase the amount of data to\n",
    "infinity, there will come a point where we can't normalize the process\n",
    "because the multivariate Gaussian has the form, $$\n",
    "\\gaussianDist{\\mappingFunctionVector}{\\zerosVector}{\\kernelMatrix} = \\frac{1}{\\left(2\\pi\\right)^{\\frac{\\numData}{2}}\\det{\\kernelMatrix}^\\frac{1}{2}} \\exp\\left(-\\frac{\\mappingFunctionVector^\\top\\kernelMatrix \\mappingFunctionVector}{2}\\right)\n",
    "$$ and a non-degenerate kernel matrix leads to $\\det{\\kernelMatrix} = 0$\n",
    "defeating the normalization (it's equivalent to finding a projection in\n",
    "the high dimensional Gaussian where the variance of the the resulting\n",
    "univariate Gaussian is zero, i.e. there is a null space on the\n",
    "covariance, or alternatively you can imagine there are one or more\n",
    "directions where the Gaussian has become the delta function).\n",
    "\n",
    "In the machine learning field, it was Radford Neal [@Neal:bayesian94]\n",
    "that realized the potential of the next step. In his 1994 thesis, he was\n",
    "considering Bayesian neural networks, of the type we described above,\n",
    "and in considered what would happen if you took the number of hidden\n",
    "nodes, or neurons, to infinity, i.e. $\\numHidden \\rightarrow \\infty$.\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/neal-infinite-priors.png\" style=\"width:80%\">\n",
    "\n",
    "Figure: <i>Page 37 of [Radford Neal's 1994\n",
    "thesis](http://www.cs.toronto.edu/~radford/ftp/thesis.pdf)</i>\n",
    "\n",
    "In loose terms, what Radford considers is what happens to the elements\n",
    "of the covariance function, $$\n",
    "  \\begin{align*}\n",
    "  \\kernel_\\mappingFunction\\left(\\inputVector_i, \\inputVector_j\\right) & = \\alpha \\activationVector\\left(\\mappingMatrix_1, \\inputVector_i\\right)^\\top \\activationVector\\left(\\mappingMatrix_1, \\inputVector_j\\right)\\\\\n",
    "  & = \\alpha \\sum_k \\activationScalar\\left(\\mappingVector^{(1)}_k, \\inputVector_i\\right) \\activationScalar\\left(\\mappingVector^{(1)}_k, \\inputVector_j\\right)\n",
    "  \\end{align*}\n",
    "  $$ if instead of considering a finite number you sample infinitely\n",
    "many of these activation functions, sampling parameters from a prior\n",
    "density, $p(\\mappingVectorTwo)$, for each one, $$\n",
    "\\kernel_\\mappingFunction\\left(\\inputVector_i, \\inputVector_j\\right) = \\alpha \\int \\activationScalar\\left(\\mappingVector^{(1)}, \\inputVector_i\\right) \\activationScalar\\left(\\mappingVector^{(1)}, \\inputVector_j\\right) p(\\mappingVector^{(1)}) \\text{d}\\mappingVector^{(1)}\n",
    "$$ And that's not *only* for Gaussian $p(\\mappingVectorTwo)$. In fact\n",
    "this result holds for a range of activations, and a range of prior\n",
    "densities because of the *central limit theorem*.\n",
    "\n",
    "To write it in the form of a probabilistic program, as long as the\n",
    "distribution for $\\phi_i$ implied by this short probabilistic program,\n",
    "$$\n",
    "  \\begin{align*}\n",
    "  \\mappingVectorTwo & \\sim p(\\cdot)\\\\\n",
    "  \\phi_i & = \\activationScalar\\left(\\mappingVectorTwo, \\inputVector_i\\right), \n",
    "  \\end{align*}\n",
    "  $$ has finite variance, then the result of taking the number of hidden\n",
    "units to infinity, with appropriate scaling, is also a Gaussian process.\n",
    "\n",
    "## Further Reading\n",
    "\n",
    "To understand this argument in more detail, I highly recommend reading\n",
    "chapter 2 of Neal's thesis [@Neal:bayesian94], which remains easy to\n",
    "read and clear today. Indeed, for readers interested in Bayesian neural\n",
    "networks, both Raford Neal's and David MacKay's PhD thesis\n",
    "[@MacKay:bayesian92] remain essential reading. Both theses embody a\n",
    "clarity of thought, and an ability to weave together threads from\n",
    "different fields that was the business of machine learning in the 1990s.\n",
    "Radford and David were also pioneers in making their software widely\n",
    "available and publishing material on the web.\n",
    "\n",
    "## Gaussian Process\n",
    "\n",
    "In our \\refnotes{session on Bayesian regression}{bayesian-regression} we\n",
    "sampled from the prior over paraemters. Through the properties of\n",
    "multivariate Gaussian densities this prior over parameters implies a\n",
    "particular density for our data observations, $\\dataVector$. In this\n",
    "session we sampled directly from this distribution for our data,\n",
    "avoiding the intermediate weight-space representation. This is the\n",
    "approach taken by *Gaussian processes*. In a Gaussian process you\n",
    "specify the *covariance function* directly, rather than *implicitly*\n",
    "through a basis matrix and a prior over parameters. Gaussian processes\n",
    "have the advantage that they can be *nonparametric*, which in simple\n",
    "terms means that they can have *infinite* basis functions. In the\n",
    "lectures we introduced the *exponentiated quadratic* covariance, also\n",
    "known as the RBF or the Gaussian or the squared exponential covariance\n",
    "function. This covariance function is specified by $$\n",
    "\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\exp\\left( -\\frac{\\left\\Vert \\inputVector-\\inputVector^\\prime\\right\\Vert^2}{2\\ell^2}\\right),\n",
    "$$ where $\\left\\Vert\\inputVector - \\inputVector^\\prime\\right\\Vert^2$ is\n",
    "the squared distance between the two input vectors $$\n",
    "\\left\\Vert\\inputVector - \\inputVector^\\prime\\right\\Vert^2 = (\\inputVector - \\inputVector^\\prime)^\\top (\\inputVector - \\inputVector^\\prime) \n",
    "$$ Let's build a covariance matrix based on this function. First we\n",
    "define the form of the covariance function,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s eq_cov mlai.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this to compute *directly* the covariance for\n",
    "$\\mappingFunctionVector$ at the points given by `x_pred`. Let's define a\n",
    "new function `K()` which does this,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s Kernel mlai.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can image the resulting covariance,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kernel = Kernel(function=eq_cov, variance=1., lengthscale=10.)\n",
    "K = kernel.K(x_pred, x_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To visualise the covariance between the points we can use the `imshow`\n",
    "function in matplotlib."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8,8))\n",
    "im = ax.imshow(K, interpolation='none')\n",
    "fig.colorbar(im)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can sample functions from the marginal likelihood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize(8, 5))\n",
    "for i in range(10):\n",
    "    y_sample = np.random.multivariate_normal(mean=np.zeros(x_pred.size), cov=K)\n",
    "    ax.plot(x_pred.flatten(), y_sample.flatten())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "**Moving Parameters** Have a play with the parameters for this\n",
    "covariance function (the lengthscale and the variance) and see what\n",
    "effects the parameters have on the types of functions you observe.\n",
    "\n",
    "### Write your answer to Exercise here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use this box for any code you need for the exercise\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian Inference by Rejection Sampling \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-intro-very-short.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "One view of Bayesian inference is to assume we are given a mechanism for\n",
    "generating samples, where we assume that mechanism is representing on\n",
    "accurate view on the way we believe the world works.\n",
    "\n",
    "This mechanism is known as our *prior* belief.\n",
    "\n",
    "We combine our prior belief with our observations of the real world by\n",
    "discarding all those samples that are inconsistent with our prior. The\n",
    "*likelihood* defines mathematically what we mean by inconsistent with\n",
    "the prior. The higher the noise level in the likelihood, the looser the\n",
    "notion of consistent.\n",
    "\n",
    "The samples that remain are considered to be samples from the\n",
    "*posterior*.\n",
    "\n",
    "This approach to Bayesian inference is closely related to two sampling\n",
    "techniques known as *rejection sampling* and *importance sampling*. It\n",
    "is realized in practice in an approach known as *approximate Bayesian\n",
    "computation* (ABC) or likelihood-free inference.\n",
    "\n",
    "In practice, the algorithm is often too slow to be practical, because\n",
    "most samples will be inconsistent with the data and as a result the\n",
    "mechanism has to be operated many times to obtain a few posterior\n",
    "samples.\n",
    "\n",
    "However, in the Gaussian process case, when the likelihood also assumes\n",
    "Gaussian noise, we can operate this mechanism mathematically, and obtain\n",
    "the posterior density *analytically*. This is the benefit of Gaussian\n",
    "processes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('gp_rejection_sample{sample:0>3}.png', \n",
    "                            directory='../slides/diagrams/gp', \n",
    "                            sample=IntSlider(1,1,5,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp_rejection_sample003.png\" style=\"width:100%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp_rejection_sample004.png\" style=\"width:100%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp_rejection_sample005.png\" style=\"width:100%\">\n",
    "\n",
    "Figure: <i>One view of Bayesian inference is we have a machine for\n",
    "generating samples (the *prior*), and we discard all samples\n",
    "inconsistent with our data, leaving the samples of interest (the\n",
    "*posterior*). The Gaussian process allows us to do this\n",
    "analytically.</i>\n",
    "\n",
    "## Gaussian Process\n",
    "\n",
    "The Gaussian process perspective takes the marginal likelihood of the\n",
    "data to be a joint Gaussian density with a covariance given by\n",
    "$\\kernelMatrix$. So the model likelihood is of the form, $$\n",
    "p(\\dataVector|\\inputMatrix) =\n",
    "\\frac{1}{(2\\pi)^{\\frac{\\numData}{2}}|\\kernelMatrix|^{\\frac{1}{2}}}\n",
    "\\exp\\left(-\\frac{1}{2}\\dataVector^\\top \\left(\\kernelMatrix+\\dataStd^2\n",
    "\\eye\\right)^{-1}\\dataVector\\right)\n",
    "$$ where the input data, $\\inputMatrix$, influences the density through\n",
    "the covariance matrix, $\\kernelMatrix$ whose elements are computed\n",
    "through the covariance function,\n",
    "$\\kernelScalar(\\inputVector, \\inputVector^\\prime)$.\n",
    "\n",
    "This means that the negative log likelihood (the objective function) is\n",
    "given by, $$\n",
    "\\errorFunction(\\boldsymbol{\\theta}) = \\frac{1}{2} \\log |\\kernelMatrix|\n",
    "+ \\frac{1}{2} \\dataVector^\\top \\left(\\kernelMatrix +\n",
    "\\dataStd^2\\eye\\right)^{-1}\\dataVector\n",
    "$$ where the *parameters* of the model are also embedded in the\n",
    "covariance function, they include the parameters of the kernel (such as\n",
    "lengthscale and variance), and the noise variance, $\\dataStd^2$. Let's\n",
    "create a class in python for storing these variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s GP mlai.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making Predictions\n",
    "\n",
    "We now have a probability density that represents functions. How do we\n",
    "make predictions with this density? The density is known as a process\n",
    "because it is *consistent*. By consistency, here, we mean that the model\n",
    "makes predictions for $\\mappingFunctionVector$ that are unaffected by\n",
    "future values of $\\mappingFunctionVector^*$ that are currently\n",
    "unobserved (such as test points). If we think of\n",
    "$\\mappingFunctionVector^*$ as test points, we can still write down a\n",
    "joint probability density over the training observations,\n",
    "$\\mappingFunctionVector$ and the test observations,\n",
    "$\\mappingFunctionVector^*$. This joint probability density will be\n",
    "Gaussian, with a covariance matrix given by our covariance function,\n",
    "$\\kernelScalar(\\inputVector_i, \\inputVector_j)$. $$\n",
    "\\begin{bmatrix}\\mappingFunctionVector \\\\ \\mappingFunctionVector^*\\end{bmatrix} \\sim \\gaussianSamp{\\zerosVector}{\\begin{bmatrix} \\kernelMatrix & \\kernelMatrix_\\ast \\\\\n",
    "\\kernelMatrix_\\ast^\\top & \\kernelMatrix_{\\ast,\\ast}\\end{bmatrix}}\n",
    "$$ where here $\\kernelMatrix$ is the covariance computed between all the\n",
    "training points, $\\kernelMatrix_\\ast$ is the covariance matrix computed\n",
    "between the training points and the test points and\n",
    "$\\kernelMatrix_{\\ast,\\ast}$ is the covariance matrix computed betwen all\n",
    "the tests points and themselves. To be clear, let's compute these now\n",
    "for our example, using `x` and `y` for the training data (although `y`\n",
    "doesn't enter the covariance) and `x_pred` as the test locations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set covariance function parameters\n",
    "variance = 16.0\n",
    "lengthscale = 8\n",
    "# set noise variance\n",
    "sigma2 = 0.05\n",
    "\n",
    "kernel = Kernel(eq_cov, variance=variance, lengthscale=lengthscale)\n",
    "K = kernel.K(x, x)\n",
    "K_star = kernel.K(x, x_pred)\n",
    "K_starstar = kernel.K(x_pred, x_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we use this structure to visualise the covariance between test data\n",
    "and training data. This structure is how information is passed between\n",
    "test and training data. Unlike the maximum likelihood formalisms we've\n",
    "been considering so far, the structure expresses *correlation* between\n",
    "our different data points. However, just like the\n",
    "\\refnotes{naive Bayes approach}{naive-bayes} we now have a *joint\n",
    "density* between some variables of interest. In particular we have the\n",
    "joint density over\n",
    "$p(\\mappingFunctionVector, \\mappingFunctionVector^*)$. The joint density\n",
    "is *Gaussian* and *zero mean*. It is specified entirely by the\n",
    "*covariance matrix*, $\\kernelMatrix$. That covariance matrix is, in\n",
    "turn, defined by a covariance function. Now we will visualise the form\n",
    "of that covariance in the form of the matrix, $$\n",
    "\\begin{bmatrix} \\kernelMatrix & \\kernelMatrix_\\ast \\\\ \\kernelMatrix_\\ast^\\top\n",
    "& \\kernelMatrix_{\\ast,\\ast}\\end{bmatrix}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8,8))\n",
    "im = ax.imshow(np.vstack([np.hstack([K, K_star]), np.hstack([K_star.T, K_starstar])]), interpolation='none')\n",
    "# Add lines for separating training and test data\n",
    "ax.axvline(x.shape[0]-1, color='w')\n",
    "ax.axhline(x.shape[0]-1, color='w')\n",
    "fig.colorbar(im)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are four blocks to this color plot. The upper left block is the\n",
    "covariance of the training data with itself, $\\kernelMatrix$. We see\n",
    "some structure here due to the missing data from the first and second\n",
    "world wars. Alongside this covariance (to the right and below) we see\n",
    "the cross covariance between the training and the test data\n",
    "($\\kernelMatrix_*$ and $\\kernelMatrix_*^\\top$). This is giving us the\n",
    "covariation between our training and our test data. Finally the lower\n",
    "right block The banded structure we now observe is because some of the\n",
    "training points are near to some of the test points. This is how we\n",
    "obtain 'communication' between our training data and our test data. If\n",
    "there is no structure in $\\kernelMatrix_*$ then our belief about the\n",
    "test data simply matches our prior.\n",
    "\n",
    "## Conditional Density\n",
    "\n",
    "Just as in naive Bayes, we first defined the joint density (although\n",
    "there it was over both the labels and the inputs,\n",
    "$p(\\dataVector, \\inputMatrix)$ and now we need to define *conditional*\n",
    "distributions that answer particular questions of interest. In\n",
    "particular we might be interested in finding out the values of the\n",
    "function for the prediction function at the test data given those at the\n",
    "training data, $p(\\mappingFunctionVector_*|\\mappingFunctionVector)$. Or\n",
    "if we include noise in the training observations then we are interested\n",
    "in the conditional density for the prediction function at the test\n",
    "locations given the training observations,\n",
    "$p(\\mappingFunctionVector^*|\\dataVector)$.\n",
    "\n",
    "As ever all the various questions we could ask about this density can be\n",
    "answered using the *sum rule* and the *product rule*. For the\n",
    "multivariate normal density the mathematics involved is that of *linear\n",
    "algebra*, with a particular emphasis on the *partitioned inverse* or\n",
    "[*block matrix\n",
    "inverse*](http://en.wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion),\n",
    "but they are beyond the scope of this course, so you don't need to worry\n",
    "about remembering them or rederiving them. We are simply writing them\n",
    "here because it is this *conditional* density that is necessary for\n",
    "making predictions.\n",
    "\n",
    "The conditional density is also a multivariate normal, $$\n",
    "\\mappingFunctionVector^* | \\mappingFunctionVector \\sim \\gaussianSamp{\\meanVector_\\mappingFunction}{\\mathbf{C}_\\mappingFunction}\n",
    "$$ with a mean given by $$\n",
    "\\meanVector_\\mappingFunction = \\kernelMatrix_*^\\top \\left[\\kernelMatrix + \\dataStd^2\n",
    "\\eye\\right]^{-1} \\dataVector\n",
    "$$ and a covariance given by $$\n",
    "\\mathbf{C}_\\mappingFunction\n",
    "= \\kernelMatrix_{*,*} - \\kernelMatrix_*^\\top \\left[\\kernelMatrix + \\dataStd^2\n",
    "\\eye\\right]^{-1} \\kernelMatrix_\\ast.\n",
    "$$ Let's compute what those posterior predictions are for the olympic\n",
    "marathon data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s posterior_f mlai.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# attach the new method to class GP():\n",
    "GP.posterior_f = posterior_f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = GP(x, y, sigma2, exponentiated_quadratic, variance=variance, lengthscale=lengthscale)\n",
    "mu_f, C_f = model.posterior_f(x_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where for convenience we've defined\n",
    "\n",
    "$$\n",
    "\\mathbf{A} = \\left[\\kernelMatrix + \\dataStd^2\\eye\\right]^{-1}\\kernelMatrix_*.\n",
    "$$\n",
    "\n",
    "We can visualize the covariance of the *conditional*,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8,8))\n",
    "im = ax.imshow(C_f, interpolation='none')\n",
    "fig.colorbar(im)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and we can plot the mean of the conditional"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(x, y, 'rx')\n",
    "plt.plot(x_pred, mu_f, 'b-')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "as well as the associated error bars. These are given (similarly to the\n",
    "Bayesian parametric model from the last lab) by the standard deviations\n",
    "of the marginal posterior densities. The marginal posterior variances\n",
    "are given by the diagonal elements of the posterior covariance,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var_f = np.diag(C_f)[:, None]\n",
    "std_f = np.sqrt(var_f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "They can be added to the underlying mean function to give the error\n",
    "bars,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(x, y, 'rx')\n",
    "plt.plot(x_pred, mu_f, 'b-')\n",
    "plt.plot(x_pred, mu_f+2*std_f, 'b--')\n",
    "plt.plot(x_pred, mu_f-2*std_f, 'b--')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us a prediction from the Gaussian process. Remember machine\n",
    "learning is $$\n",
    "\\text{data} + \\text{model} \\rightarrow \\text{prediction}.\n",
    "$$ Here our data is from the olympics, and our model is a Gaussian\n",
    "process with two parameters. The assumptions about the world are encoded\n",
    "entirely into our Gaussian process covariance. The GP covariance assumes\n",
    "that the function is highly smooth, and that correlation falls off with\n",
    "distance (scaled according to the length scale, $\\ell$). The model\n",
    "sustains the uncertainty about the function, this means we see an\n",
    "increase in the size of the error bars during periods like the 1st and\n",
    "2nd World Wars when no olympic marathon was held.\n",
    "\n",
    "### Exercise\n",
    "\n",
    "Now try changing the parameters of the covariance function (and the\n",
    "noise) to see how the predictions change.\n",
    "\n",
    "Then try sampling from this conditional density to see what your\n",
    "predictions look like. What happens if you sample from the conditional\n",
    "density in regions a long way into the future or the past? How does this\n",
    "compare with the results from the polynomial model?\n",
    "\n",
    "### Write your answer to Exercise here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use this box for any code you need for the exercise\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Importance of the Covariance Function\n",
    "\n",
    "The covariance function encapsulates our assumptions about the data. The\n",
    "equations for the distribution of the prediction function, given the\n",
    "training observations, are highly sensitive to the covariation between\n",
    "the test locations and the training locations as expressed by the matrix\n",
    "$\\kernelMatrix_*$. We defined a matrix $\\mathbf{A}$ which allowed us to\n",
    "express our conditional mean in the form, $$\n",
    "\\meanVector_\\mappingFunction = \\mathbf{A}^\\top \\dataVector,\n",
    "$$ where $\\dataVector$ were our *training observations*. In other words\n",
    "our mean predictions are always a linear weighted combination of our\n",
    "*training data*. The weights are given by computing the covariation\n",
    "between the training and the test data ($\\kernelMatrix_*$) and scaling\n",
    "it by the inverse covariance of the training data observations,\n",
    "$\\left[\\kernelMatrix + \\dataStd^2 \\eye\\right]^{-1}$. This inverse is the\n",
    "main computational object that needs to be resolved for a Gaussian\n",
    "process. It has a computational burden which is $O(\\numData^3)$ and a\n",
    "storage burden which is $O(\\numData^2)$. This makes working with\n",
    "Gaussian processes computationally intensive for the situation where\n",
    "$\\numData>10,000$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.lib.display import YouTubeVideo\n",
    "YouTubeVideo('ewJ3AxKclOg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Improving the Numerics\n",
    "\n",
    "In practice we shouldn't be using matrix inverse directly to solve the\n",
    "GP system. One more stable way is to compute the *Cholesky\n",
    "decomposition* of the kernel matrix. The log determinant of the\n",
    "covariance can also be derived from the Cholesky decomposition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s update_inverse mlai.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "GP.update_inverse = update_inverse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Capacity Control\n",
    "\n",
    "Gaussian processes are sometimes seen as part of a wider family of\n",
    "methods known as kernel methods. Kernel methods are also based around\n",
    "covariance functions, but in the field they are known as Mercer kernels.\n",
    "Mercer kernels have interpretations as inner products in potentially\n",
    "infinite dimensional Hilbert spaces. This interpretation arises because,\n",
    "if we take $\\alpha=1$, then the kernel can be expressed as $$\n",
    "\\kernelMatrix = \\basisMatrix\\basisMatrix^\\top \n",
    "$$ which imples the elements of the kernel are given by, $$\n",
    "\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\basisVector(\\inputVector)^\\top \\basisVector(\\inputVector^\\prime).\n",
    "$$ So we see that the kernel function is developed from an inner product\n",
    "between the basis functions. Mercer's theorem tells us that any valid\n",
    "*positive definite function* can be expressed as this inner product but\n",
    "with the caveat that the inner product could be *infinite length*. This\n",
    "idea has been used quite widely to *kernelize* algorithms that depend on\n",
    "inner products. The kernel functions are equivalent to covariance\n",
    "functions and they are parameterized accordingly. In the kernel modeling\n",
    "community it is generally accepted that kernel parameter estimation is a\n",
    "difficult problem and the normal solution is to cross validate to obtain\n",
    "parameters. This can cause difficulties when a large number of kernel\n",
    "parameters need to be estimated. In Gaussian process modelling kernel\n",
    "parameter estimation (in the simplest case proceeds) by maximum\n",
    "likelihood. This involves taking gradients of the likelihood with\n",
    "respect to the parameters of the covariance function.\n",
    "\n",
    "## Gradients of the Likelihood\n",
    "\n",
    "The easiest conceptual way to obtain the gradients is a two step\n",
    "process. The first step involves taking the gradient of the likelihood\n",
    "with respect to the covariance function, the second step involves\n",
    "considering the gradient of the covariance function with respect to its\n",
    "parameters.\n",
    "\n",
    "## Overall Process Scale\n",
    "\n",
    "In general we won't be able to find parameters of the covariance\n",
    "function through fixed point equations, we will need to do gradient\n",
    "based optimization.\n",
    "\n",
    "## Capacity Control and Data Fit\n",
    "\n",
    "The objective function can be decomposed into two terms, a capacity\n",
    "control term, and a data fit term. The capacity control term is the log\n",
    "determinant of the covariance. The data fit term is the matrix inner\n",
    "product between the data and the inverse covariance.\n",
    "\n",
    "Can we determine covariance parameters from the data?\n",
    "\n",
    "$$\n",
    "\\gaussianDist{\\dataVector}{\\mathbf{0}}{\\kernelMatrix}=\\frac{1}{(2\\pi)^\\frac{\\numData}{2}{\\det{\\kernelMatrix}^{\\frac{1}{2}}}}{\\exp\\left(-\\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}\\right)}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "    \\gaussianDist{\\dataVector}{\\mathbf{0}}{\\kernelMatrix}=\\frac{1}{(2\\pi)^\\frac{\\numData}{2}{\\color{black} \\det{\\kernelMatrix}^{\\frac{1}{2}}}}{\\color{black}\\exp\\left(-\\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}\\right)}\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "    \\log \\gaussianDist{\\dataVector}{\\mathbf{0}}{\\kernelMatrix}=&{\\color{black}-\\frac{1}{2}\\log\\det{\\kernelMatrix}}{\\color{black}-\\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}} \\\\ &-\\frac{\\numData}{2}\\log2\\pi\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\errorFunction(\\parameterVector) = {\\color{black} \\frac{1}{2}\\log\\det{\\kernelMatrix}} + {\\color{black} \\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}}\n",
    "$$\n",
    "\n",
    "The parameters are *inside* the covariance function (matrix).\n",
    "\\normalsize\n",
    "$$\\kernelScalar_{i, j} = \\kernelScalar(\\inputVals_i, \\inputVals_j; \\parameterVector)$$\n",
    "\n",
    "\\Large\n",
    "$$\\kernelMatrix = \\rotationMatrix \\eigenvalueMatrix^2 \\rotationMatrix^\\top$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gpoptimizePlot1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp-optimize-eigen.png\" style=\"width:100%\">\n",
    "</td>\n",
    "<td width=\"50%\">\n",
    "$\\eigenvalueMatrix$ represents distance on axes. $\\rotationMatrix$ gives\n",
    "rotation.\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "-   $\\eigenvalueMatrix$ is *diagonal*,\n",
    "    $\\rotationMatrix^\\top\\rotationMatrix = \\eye$.\n",
    "-   Useful representation since\n",
    "    $\\det{\\kernelMatrix} = \\det{\\eigenvalueMatrix^2} = \\det{\\eigenvalueMatrix}^2$.\n",
    "\n",
    "<!--\n",
    "\\only<1>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant1}}\\only<2>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant2}}\\only<3>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant3}}\\only<4>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant4}}\\only<5>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant5}}\\only<6>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant6}}\\only<7>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant7}}\\only<8>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant8}}\\only<9>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant9}}\\only<10>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseDeterminant10}}-->\n",
    "<!--```{.python}\n",
    "gpoptimizePlot3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\\only<1>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseQuadratic1}}\\only<2>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseQuadratic2}}\\only<3>{\\inputdiagram{../../../gp/tex/diagrams/gpOptimiseQuadratic3}}-->\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp-optimise006.svg\" class=\"\" align=\"100%\" style=\"vertical-align:middle;\">\n",
    "</td>\n",
    "<td width=\"50%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp-optimise010.svg\" class=\"\" align=\"100%\" style=\"vertical-align:middle;\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp-optimise016.svg\" class=\"\" align=\"100%\" style=\"vertical-align:middle;\">\n",
    "</td>\n",
    "<td width=\"50%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gp-optimise021.svg\" class=\"\" align=\"100%\" style=\"vertical-align:middle;\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>Variation in the data fit term, the capacity term and the\n",
    "negative log likelihood for different lengthscales.</i>\n",
    "\n",
    "## Exponentiated Quadratic Covariance \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_kern/includes/eq-covariance.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The exponentiated quadratic covariance, also known as the Gaussian\n",
    "covariance or the RBF covariance and the squared exponential. Covariance\n",
    "between two points is related to the negative exponential of the squared\n",
    "distnace between those points. This covariance function can be derived\n",
    "in a few different ways: as the infinite limit of a radial basis\n",
    "function neural network, as diffusion in the heat equation, as a\n",
    "Gaussian filter in *Fourier space* or as the composition as a series of\n",
    "linear filters applied to a base function.\n",
    "\n",
    "The covariance takes the following form, $$\n",
    "\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\exp\\left(-\\frac{\\ltwoNorm{\\inputVector-\\inputVector^\\prime}^2}{2\\lengthScale^2}\\right)\n",
    "$$ where $\\ell$ is the *length scale* or *time scale* of the process and\n",
    "$\\alpha$ represents the overall process variance.\n",
    "\n",
    "<center>\n",
    "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\exp\\left(-\\frac{\\ltwoNorm{\\inputVector-\\inputVector^\\prime}^2}{2\\lengthScale^2}\\right)$$\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"45%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/eq_covariance.svg\" class=\"100%\" align=\"\" style=\"vertical-align:middle\">\n",
    "</td>\n",
    "<td width=\"45%\">\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/eq_covariance.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>The exponentiated quadratic covariance function.</i>\n",
    "\n",
    "## Olympic Marathon Data \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/olympic-marathon-data.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"70%\">\n",
    "-   Gold medal times for Olympic Marathon since 1896.\n",
    "-   Marathons before 1924 didn't have a standardised distance.\n",
    "-   Present results using pace per km.\n",
    "-   In 1904 Marathon was badly organised leading to very slow times.\n",
    "\n",
    "</td>\n",
    "<td width=\"30%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/Stephen_Kiprotich.jpg\" style=\"width:100%\">\n",
    "<small>Image from Wikimedia Commons <http://bit.ly/16kMKHQ></small>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "The first thing we will do is load a standard data set for regression\n",
    "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n",
    "winners for the Olympics from 1896 to present. First we load in the data\n",
    "and plot.\n",
    "\n",
    "``` {.python}\n",
    "import numpy as np\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "offset = y.mean()\n",
    "scale = np.sqrt(y.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import teaching_plots as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "xlim = (1875,2030)\n",
    "ylim = (2.5, 6.5)\n",
    "yhat = (y-offset)/scale\n",
    "\n",
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "_ = ax.plot(x, y, 'r.',markersize=10)\n",
    "ax.set_xlabel('year', fontsize=20)\n",
    "ax.set_ylabel('pace min/km', fontsize=20)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "\n",
    "mlai.write_figure(figure=fig, \n",
    "                  filename='../slides/diagrams/datasets/olympic-marathon.svg', \n",
    "                  transparent=True, \n",
    "                  frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/datasets/olympic-marathon.svg\" class=\"\" align=\"\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Olympic marathon pace times since 1892.</i>\n",
    "\n",
    "Things to notice about the data include the outlier in 1904, in this\n",
    "year, the olympics was in St Louis, USA. Organizational problems and\n",
    "challenges with dust kicked up by the cars following the race meant that\n",
    "participants got lost, and only very few participants completed.\n",
    "\n",
    "More recent years see more consistently quick marathons.\n",
    "\n",
    "## Alan Turing \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/alan-turing-marathon.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/turing-times.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "<td width=\"50%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/turing-run.jpg\" style=\"width:50%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>Alan Turing, in 1946 he was only 11 minutes slower than the\n",
    "winner of the 1948 games. Would he have won a hypothetical games held in\n",
    "1946? Source: [Alan Turing Internet\n",
    "Scrapbook](http://www.turing.org.uk/scrapbook/run.html).</i>\n",
    "\n",
    "If we had to summarise the objectives of machine learning in one word, a\n",
    "very good candidate for that word would be *generalization*. What is\n",
    "generalization? From a human perspective it might be summarised as the\n",
    "ability to take lessons learned in one domain and apply them to another\n",
    "domain. If we accept the definition given in the first session for\n",
    "machine learning, $$\n",
    "\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}\n",
    "$$ then we see that without a model we can't generalise: we only have\n",
    "data. Data is fine for answering very specific questions, like \"Who won\n",
    "the Olympic Marathon in 2012?\", because we have that answer stored,\n",
    "however, we are not given the answer to many other questions. For\n",
    "example, Alan Turing was a formidable marathon runner, in 1946 he ran a\n",
    "time 2 hours 46 minutes (just under four minutes per kilometer, faster\n",
    "than I and most of the other [Endcliffe Park\n",
    "Run](http://www.parkrun.org.uk/sheffieldhallam/) runners can do 5 km).\n",
    "What is the probability he would have won an Olympics if one had been\n",
    "held in 1946?\n",
    "\n",
    "To answer this question we need to generalize, but before we formalize\n",
    "the concept of generalization let's introduce some formal representation\n",
    "of what it means to generalize in machine learning.\n",
    "\n",
    "Our first objective will be to perform a Gaussian process fit to the\n",
    "data, we'll do this using the [GPy\n",
    "software](https://github.com/SheffieldML/GPy)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import GPy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_full = GPy.models.GPRegression(x,yhat)\n",
    "_ = m_full.optimize() # Optimize parameters of covariance function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first command sets up the model, then `m_full.optimize()` optimizes\n",
    "the parameters of the covariance function and the noise level of the\n",
    "model. Once the fit is complete, we'll try creating some test points,\n",
    "and computing the output of the GP model in terms of the mean and\n",
    "standard deviation of the posterior functions between 1870 and 2030. We\n",
    "plot the mean function and the standard deviation at 200 locations. We\n",
    "can obtain the predictions using `y_mean, y_var = m_full.predict(xt)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xt = np.linspace(1870,2030,200)[:,np.newaxis]\n",
    "yt_mean, yt_var = m_full.predict(xt)\n",
    "yt_sd=np.sqrt(yt_var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we plot the results using the helper function in `teaching_plots`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "mlai.write_figure(figure=fig,\n",
    "                  filename='../slides/diagrams/gp/olympic-marathon-gp.svg', \n",
    "                  transparent=True, frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/olympic-marathon-gp.svg\" class=\"\" align=\"\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Gaussian process fit to the Olympic Marathon data. The error\n",
    "bars are too large, perhaps due to the outlier from 1904.</i>\n",
    "\n",
    "## Fit Quality\n",
    "\n",
    "In the fit we see that the error bars (coming mainly from the noise\n",
    "variance) are quite large. This is likely due to the outlier point in\n",
    "1904, ignoring that point we can see that a tighter fit is obtained. To\n",
    "see this making a version of the model, `m_clean`, where that point is\n",
    "removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_clean=np.vstack((x[0:2, :], x[3:, :]))\n",
    "y_clean=np.vstack((y[0:2, :], y[3:, :]))\n",
    "\n",
    "m_clean = GPy.models.GPRegression(x_clean,y_clean)\n",
    "_ = m_clean.optimize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gene Expression Example \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/della-gatta-gene-gp.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "We now consider an example in gene expression. Gene expression is the\n",
    "measurement of mRNA levels expressed in cells. These mRNA levels show\n",
    "which genes are 'switched on' and producing data. In the example we will\n",
    "use a Gaussian process to determine whether a given gene is active, or\n",
    "we are merely observing a noise response.\n",
    "\n",
    "## Della Gatta Gene Data \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/della-gatta-gene-data.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "-   Given given expression levels in the form of a time series from\n",
    "    @DellaGatta:direct08."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.della_gatta_TRP63_gene_expression(data_set='della_gatta',gene_number=937)\n",
    "\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "offset = y.mean()\n",
    "scale = np.sqrt(y.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import teaching_plots as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "xlim = (-20,260)\n",
    "ylim = (5, 7.5)\n",
    "yhat = (y-offset)/scale\n",
    "\n",
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "_ = ax.plot(x, y, 'r.',markersize=10)\n",
    "ax.set_xlabel('time/min', fontsize=20)\n",
    "ax.set_ylabel('expression', fontsize=20)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "\n",
    "mlai.write_figure(figure=fig, \n",
    "                  filename='../slides/diagrams/datasets/della-gatta-gene.svg', \n",
    "                  transparent=True, \n",
    "                  frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/datasets/della-gatta-gene.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Gene expression levels over time for a gene from data\n",
    "provided by @DellaGatta:direct08. We would like to understand whethere\n",
    "there is signal in the data, or we are only observing noise.</i>\n",
    "\n",
    "-   Want to detect if a gene is expressed or not, fit a GP to each gene\n",
    "    @Kalaitzis:simple11.\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/1471-2105-12-180_1.png\" style=\"width:80%\">\n",
    "\n",
    "Figure: <i>The example is taken from the paper \"A Simple Approach to\n",
    "Ranking Differentially Expressed Gene Expression Time Courses through\n",
    "Gaussian Process Regression.\" @Kalaitzis:simple11.</i>\n",
    "\n",
    "<center>\n",
    "<http://www.biomedcentral.com/1471-2105/12/180>\n",
    "</center>\n",
    "Our first objective will be to perform a Gaussian process fit to the\n",
    "data, we'll do this using the [GPy\n",
    "software](https://github.com/SheffieldML/GPy)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import GPy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_full = GPy.models.GPRegression(x,yhat)\n",
    "m_full.kern.lengthscale=50\n",
    "_ = m_full.optimize() # Optimize parameters of covariance function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the length scale parameter (which here actually represents a\n",
    "*time scale* of the covariance function) to a reasonable value. Default\n",
    "would be 1, but here we set it to 50 minutes, given points are arriving\n",
    "across zero to 250 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xt = np.linspace(-20,260,200)[:,np.newaxis]\n",
    "yt_mean, yt_var = m_full.predict(xt)\n",
    "yt_sd=np.sqrt(yt_var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we plot the results using the helper function in `teaching_plots`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full.log_likelihood()), fontsize=20)\n",
    "mlai.write_figure(figure=fig,\n",
    "                  filename='../slides/diagrams/gp/della-gatta-gene-gp.svg', \n",
    "                  transparent=True, frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/della-gatta-gene-gp.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Result of the fit of the Gaussian process model with the time\n",
    "scale parameter initialized to 50 minutes.</i>\n",
    "\n",
    "Now we try a model initialized with a longer length scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_full2 = GPy.models.GPRegression(x,yhat)\n",
    "m_full2.kern.lengthscale=2000\n",
    "_ = m_full2.optimize() # Optimize parameters of covariance function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "plot.model_output(m_full2, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full2.log_likelihood()), fontsize=20)\n",
    "mlai.write_figure(figure=fig,\n",
    "                  filename='../slides/diagrams/gp/della-gatta-gene-gp2.svg', \n",
    "                  transparent=True, frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/della-gatta-gene-gp2.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Result of the fit of the Gaussian process model with the time\n",
    "scale parameter initialized to 2000 minutes.</i>\n",
    "\n",
    "Now we try a model initialized with a lower noise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_full3 = GPy.models.GPRegression(x,yhat)\n",
    "m_full3.kern.lengthscale=20\n",
    "m_full3.likelihood.variance=0.001\n",
    "_ = m_full3.optimize() # Optimize parameters of covariance function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "plot.model_output(m_full3, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full3.log_likelihood()), fontsize=20)\n",
    "mlai.write_figure(figure=fig,\n",
    "                  filename='../slides/diagrams/gp/della-gatta-gene-gp3.svg', \n",
    "                  transparent=True, frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/della-gatta-gene-gp3.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Result of the fit of the Gaussian process model with the\n",
    "noise initialized low (standard deviation 0.1) and the time scale\n",
    "parameter initialized to 20 minutes.</i>\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/multiple-optima000.svg\" class=\"\" align=\"50%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i></i>\n",
    "\n",
    "<!--\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/multiple-optima001.svg\" class=\"\" align=\"\" style=\"vertical-align:middle;\">-->\n",
    "## Example: Prediction of Malaria Incidence in Uganda \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_health/includes/malaria-gp.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "[<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/people/2013_03_28_180606.JPG\" width=\"1.5cm\" style=\"background:none; border:none; box-shadow:none; position:absolute; clip:rect(2662px,1780px,1110px,600px);vertical-align:middle\">]{style=\"text-align:right\"}\n",
    "\n",
    "As an example of using Gaussian process models within the full pipeline\n",
    "from data to decsion, we'll consider the prediction of Malaria incidence\n",
    "in Uganda. For the purposes of this study malaria reports come in two\n",
    "forms, HMIS reports from health centres and Sentinel data, which is\n",
    "curated by the WHO. There are limited sentinel sites and many HMIS\n",
    "sites.\n",
    "\n",
    "The work is from Ricardo Andrade Pacheco's PhD thesis, completed in\n",
    "collaboration with John Quinn and Martin Mubangizi\n",
    "[@Andrade:consistent14; @Mubangizi:malaria14]. John and Martin were\n",
    "initally from the AI-DEV group from the University of Makerere in\n",
    "Kampala and more latterly they were based at UN Global Pulse in Kampala.\n",
    "\n",
    "Malaria data is spatial data. Uganda is split into districts, and health\n",
    "reports can be found for each district. This suggests that models such\n",
    "as conditional random fields could be used for spatial modelling, but\n",
    "there are two complexities with this. First of all, occasionally\n",
    "districts split into two. Secondly, sentinel sites are a specific\n",
    "location within a district, such as Nagongera which is a sentinel site\n",
    "based in the Tororo district.\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/uganda-districts-2006.png\" style=\"width:50%\">\n",
    "\n",
    "Figure: <i>Ugandan districs. Data SRTM/NASA from\n",
    "<https://dds.cr.usgs.gov/srtm/version2_1>.</i>\n",
    "\n",
    "[[@Andrade:consistent14; @Mubangizi:malaria14]]{style=\"text-align:right\"}\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/Kapchorwa_District_in_Uganda.svg\" class=\"50%\" align=\"\" style=\"vertical-align:middle\">\n",
    "\n",
    "Figure: <i>The Kapchorwa District, home district of Stephen\n",
    "Kiprotich.</i>\n",
    "\n",
    "Stephen Kiprotich, the 2012 gold medal winner from the London Olympics,\n",
    "comes from Kapchorwa district, in eastern Uganda, near the border with\n",
    "Kenya.\n",
    "\n",
    "The common standard for collecting health data on the African continent\n",
    "is from the Health management information systems (HMIS). However, this\n",
    "data suffers from missing values [@Gething:hmis06] and diagnosis of\n",
    "diseases like typhoid and malaria may be confounded.\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/Tororo_District_in_Uganda.svg\" class=\"50%\" align=\"\" style=\"vertical-align:middle\">\n",
    "\n",
    "Figure: <i>The Tororo district, where the sentinel site, Nagongera, is\n",
    "located.</i>\n",
    "\n",
    "[World Health Organization Sentinel Surveillance\n",
    "systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)\n",
    "are set up \"when high-quality data are needed about a particular disease\n",
    "that cannot be obtained through a passive system\". Several sentinel\n",
    "sites give accurate assessment of malaria disease levels in Uganda,\n",
    "including a site in Nagongera.\n",
    "\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/sentinel_nagongera.png\" style=\"width:100%\">\n",
    "\n",
    "Figure: <i>Sentinel and HMIS data along with rainfall and temperature\n",
    "for the Nagongera sentinel station in the Tororo district.</i>\n",
    "\n",
    "In collaboration with the AI Research Group at Makerere we chose to\n",
    "investigate whether Gaussian process models could be used to assimilate\n",
    "information from these two different sources of disease informaton.\n",
    "Further, we were interested in whether local information on rainfall and\n",
    "temperature could be used to improve malaria estimates.\n",
    "\n",
    "The aim of the project was to use WHO Sentinel sites, alongside rainfall\n",
    "and temperature, to improve predictions from HMIS data of levels of\n",
    "malaria.\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/Mubende_District_in_Uganda.svg\" class=\"50%\" align=\"\" style=\"vertical-align:middle\">\n",
    "\n",
    "Figure: <i>The Mubende District.</i>\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/mubende.png\" style=\"width:80%\">\n",
    "\n",
    "Figure: <i>Prediction of malaria incidence in Mubende.</i>\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gpss/1157497_513423392066576_1845599035_n.jpg\" style=\"width:80%\">\n",
    "\n",
    "Figure: <i>The project arose out of the Gaussian process summer school\n",
    "held at Makerere in Kampala in 2013. The school led, in turn, to the\n",
    "Data Science Africa initiative.</i>\n",
    "\n",
    "## Early Warning Systems\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/Kabarole_District_in_Uganda.svg\" class=\"50%\" align=\"\" style=\"vertical-align:middle\">\n",
    "\n",
    "Figure: <i>The Kabarole district in Uganda.</i>\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/kabarole.gif\" style=\"width:100%\">\n",
    "\n",
    "Figure: <i>Estimate of the current disease situation in the Kabarole\n",
    "district over time. Estimate is constructed with a Gaussian process with\n",
    "an additive covariance funciton.</i>\n",
    "\n",
    "Health monitoring system for the Kabarole district. Here we have fitted\n",
    "the reports with a Gaussian process with an additive covariance\n",
    "function. It has two components, one is a long time scale component (in\n",
    "red above) the other is a short time scale component (in blue).\n",
    "\n",
    "Monitoring proceeds by considering two aspects of the curve. Is the blue\n",
    "line (the short term report signal) above the red (which represents the\n",
    "long term trend? If so we have higher than expected reports. If this is\n",
    "the case *and* the gradient is still positive (i.e. reports are going\n",
    "up) we encode this with a *red* color. If it is the case and the\n",
    "gradient of the blue line is negative (i.e. reports are going down) we\n",
    "encode this with an *amber* color. Conversely, if the blue line is below\n",
    "the red *and* decreasing, we color *green*. On the other hand if it is\n",
    "below red but increasing, we color *yellow*.\n",
    "\n",
    "This gives us an early warning system for disease. Red is a bad\n",
    "situation getting worse, amber is bad, but improving. Green is good and\n",
    "getting better and yellow good but degrading.\n",
    "\n",
    "Finally, there is a gray region which represents when the scale of the\n",
    "effect is small.\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/health/monitor.gif\" style=\"width:50%\">\n",
    "\n",
    "Figure: <i>The map of Ugandan districts with an overview of the Malaria\n",
    "situation in each district.</i>\n",
    "\n",
    "These colors can now be observed directly on a spatial map of the\n",
    "districts to give an immediate impression of the current status of the\n",
    "disease across the country.\n",
    "\n",
    "## Additive Covariance \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_kern/includes/add-covariance.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "An additive covariance function is derived from considering the result\n",
    "of summing two Gaussian processes together. If the first Gaussian\n",
    "process is $g(\\cdot)$, governed by covariance\n",
    "$\\kernelScalar_g(\\cdot, \\cdot)$ and the second process is $h(\\cdot)$,\n",
    "governed by covariance $\\kernelScalar_h(\\cdot, \\cdot)$ then the combined\n",
    "process $f(\\cdot) = g(\\cdot) + h(\\cdot)$ is govererned by a covariance\n",
    "function, $$\n",
    "\\kernelScalar_f(\\inputVector, \\inputVector^\\prime) = \\kernelScalar_g(\\inputVector, \\inputVector^\\prime) + \\kernelScalar_h(\\inputVector, \\inputVector^\\prime)\n",
    "$$\n",
    "\n",
    "<center>\n",
    "$$\\kernelScalar_f(\\inputVector, \\inputVector^\\prime) = \\kernelScalar_g(\\inputVector, \\inputVector^\\prime) + \\kernelScalar_h(\\inputVector, \\inputVector^\\prime)$$\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"45%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/add_covariance.svg\" class=\"100%\" align=\"\" style=\"vertical-align:middle\">\n",
    "</td>\n",
    "<td width=\"45%\">\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/add_covariance.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>An additive covariance function formed by combining two\n",
    "exponentiated quadratic covariance functions.</i>\n",
    "\n",
    "## Analysis of US Birth Rates \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/bda-forecasting.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bialik-fridaythe13th-1.png\" style=\"width:70%\">\n",
    "\n",
    "Figure: <i>This is a retrospective analysis of US births by Aki Vehtari.\n",
    "The challenges of forecasting. Even with seasonal and weekly effects\n",
    "removed there are significant effects on holidays, weekends, etc.</i>\n",
    "\n",
    "There's a nice analysis of US birth rates by Gaussian processes with\n",
    "additive covariances in @Gelman:bayesian13. A combination of covariance\n",
    "functions are used to take account of weekly and yearly trends. The\n",
    "analysis is summarized on the cover of the book.\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bda_cover_1.png\" style=\"width:80%\">\n",
    "</td>\n",
    "<td width=\"50%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bda_cover.png\" style=\"width:80%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>Two different editions of Bayesian Data Analysis\n",
    "[@Gelman:bayesian13].</i>\n",
    "\n",
    "## Basis Function Covariance \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_kern/includes/basis-covariance.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The fixed basis function covariance just comes from the properties of a\n",
    "multivariate Gaussian, if we decide $$\n",
    "\\mappingFunctionVector=\\basisMatrix\\mappingVector\n",
    "$$ and then we assume $$\n",
    "\\mappingVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\eye}\n",
    "$$ then it follows from the properties of a multivariate Gaussian that\n",
    "$$\n",
    "\\mappingFunctionVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\basisMatrix\\basisMatrix^\\top}\n",
    "$$ meaning that the vector of observations from the function is jointly\n",
    "distributed as a Gaussian process and the covariance matrix is\n",
    "$\\kernelMatrix = \\alpha\\basisMatrix \\basisMatrix^\\top$, each element of\n",
    "the covariance matrix can then be found as the inner product between two\n",
    "rows of the basis funciton matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s basis_cov mlai.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s radial mlai.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "$$\\kernel(\\inputVector, \\inputVector^\\prime) = \\basisVector(\\inputVector)^\\top \\basisVector(\\inputVector^\\prime)$$\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"45%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/basis_covariance.svg\" class=\"100%\" align=\"\" style=\"vertical-align:middle\">\n",
    "</td>\n",
    "<td width=\"45%\">\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/basis_covariance.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>A covariance function based on a non-linear basis given by\n",
    "$\\basisVector(\\inputVector)$.</i>\n",
    "\n",
    "## Brownian Covariance \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_kern/includes/brownian-covariance.md\" target=\"_blank\" >edit</a>\\]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s brownian_cov mlai.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Brownian motion is also a Gaussian process. It follows a Gaussian random\n",
    "walk, with diffusion occuring at each time point driven by a Gaussian\n",
    "input. This implies it is both Markov and Gaussian. The covariance\n",
    "function for Brownian motion has the form $$\n",
    "\\kernelScalar(t, t^\\prime)=\\alpha \\min(t, t^\\prime)\n",
    "$$\n",
    "\n",
    "<center>\n",
    "$$\\kernelScalar(t, t^\\prime)=\\alpha \\min(t, t^\\prime)$$\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"45%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/brownian_covariance.svg\" class=\"100%\" align=\"\" style=\"vertical-align:middle\">\n",
    "</td>\n",
    "<td width=\"45%\">\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/brownian_covariance.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>Brownian motion covariance function.</i>\n",
    "\n",
    "## MLP Covariance \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_kern/includes/mlp-covariance.md\" target=\"_blank\" >edit</a>\\]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s mlp_cov mlai.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The multi-layer perceptron (MLP) covariance, also known as the neural\n",
    "network covariance or the arcsin covariance, is derived by considering\n",
    "the infinite limit of a neural network.\n",
    "\n",
    "<center>\n",
    "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\arcsin\\left(\\frac{w \\inputVector^\\top \\inputVector^\\prime + b}{\\sqrt{\\left(w \\inputVector^\\top \\inputVector + b + 1\\right)\\left(w \\left.\\inputVector^\\prime\\right.^\\top \\inputVector^\\prime + b + 1\\right)}}\\right)$$\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"45%\">\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/mlp_covariance.svg\" class=\"100%\" align=\"\" style=\"vertical-align:middle\">\n",
    "</td>\n",
    "<td width=\"45%\">\n",
    "<img class=\"negate\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/kern/mlp_covariance.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>The multi-layer perceptron covariance function. This is\n",
    "derived by considering the infinite limit of a neural network with\n",
    "probit activation functions.</i>\n",
    "\n",
    "## GPSS: Gaussian Process Summer School \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-summer-school.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "::: {style=\"width:1.5cm;text-align:center\"}\n",
    "\\includesvgclass{../slides/diagrams/logo/gpss-logo.svg}\n",
    ":::\n",
    "\n",
    "If you're interested in finding out more about Gaussian processes, you\n",
    "can attend the Gaussian process summer school, or view the lectures and\n",
    "material on line. Details of the school, future events and past events\n",
    "can be found at the website <http://gpss.cc>.\n",
    "\n",
    "## GPy: A Gaussian Process Framework in Python \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_gp/includes/gpy-software.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/gp/gpy.png\" style=\"width:70%\">\n",
    "\n",
    "Figure: <i>GPy is a BSD licensed software code base for implementing\n",
    "Gaussian process models in Python. It is designed for teaching and\n",
    "modelling. We welcome contributions which can be made through the Github\n",
    "repository <https://github.com/SheffieldML/GPy></i>\n",
    "\n",
    "GPy is a BSD licensed software code base for implementing Gaussian\n",
    "process models in python. This allows GPs to be combined with a wide\n",
    "variety of software libraries.\n",
    "\n",
    "The software itself is available on\n",
    "[GitHub](https://github.com/SheffieldML/GPy) and the team welcomes\n",
    "contributions.\n",
    "\n",
    "The aim for GPy is to be a probabilistic-style programming language,\n",
    "i.e. you specify the model rather than the algorithm. As well as a large\n",
    "range of covariance functions the software allows for non-Gaussian\n",
    "likelihoods, multivariate outputs, dimensionality reduction and\n",
    "approximations for larger data sets.\n",
    "\n",
    "# References {#references .unnumbered}"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}