{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gaussian Processes\n", "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n", "### 2019-01-09\n", "\n", "**Abstract**: Classical machine learning and statistical approaches to learning, such\n", "as neural networks and linear regression, assume a parametric form for\n", "functions. Gaussian process models are an alternative approach that\n", "assumes a probabilistic prior over functions. This brings benefits, in\n", "that uncertainty of function estimation is sustained throughout\n", "inference, and some challenges: algorithms for fitting Gaussian\n", "processes tend to be more complex than parametric models. In this\n", "sessions I will introduce Gaussian processes and explain why sustaining\n", "uncertainty is important.\n", "\n", "$$\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}{\\left|#1\\right|}\n", "\\newcommand{\\diag}{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}{\\left<#1\\right>}\n", "\\newcommand{\\expectation}{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "\n", ".\n", "\n", "\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "@Rasmussen:book06 is still one of the most important references on\n", "Gaussian process models. It is [available freely\n", "online](http://www.gaussianprocess.org/gpml/).\n", "\n", "# What is Machine Learning? [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "What is machine learning? At its most basic level machine learning is a\n", "combination of\n", "\n", "$$\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", "where *data* is our observations. They can be actively or passively\n", "acquired (meta-data). The *model* contains our assumptions, based on\n", "previous experience. That experience can be other data, it can come from\n", "transfer learning, or it can merely be our beliefs about the\n", "regularities of the universe. In humans our models include our inductive\n", "biases. The *prediction* is an action to be taken or a categorization or\n", "a quality score. The reason that machine learning has become a mainstay\n", "of artificial intelligence is the importance of predictions in\n", "artificial intelligence. The data and the model are combined through\n", "computation.\n", "\n", "In practice we normally perform machine learning using two functions. To\n", "combine data with a model we typically make use of:\n", "\n", "**a prediction function** a function which is used to make the\n", "predictions. It includes our beliefs about the regularities of the\n", "universe, our assumptions about how the world works, e.g. smoothness,\n", "spatial similarities, temporal similarities.\n", "\n", "**an objective function** a function which defines the cost of\n", "misprediction. Typically it includes knowledge about the world's\n", "generating processes (probabilistic objectives) or the costs we pay for\n", "mispredictions (empiricial risk minimization).\n", "\n", "The combination of data and model through the prediction function and\n", "the objectie function leads to a *learning algorithm*. The class of\n", "prediction functions and objective functions we can make use of is\n", "restricted by the algorithms they lead to. If the prediction function or\n", "the objective function are too complex, then it can be difficult to find\n", "an appropriate learning algorithm. Much of the acdemic field of machine\n", "learning is the quest for new learning algorithms that allow us to bring\n", "different types of models and data together.\n", "\n", "A useful reference for state of the art in machine learning is the UK\n", "Royal Society Report, [Machine Learning: Power and Promise of Computers\n", "that Learn by\n", "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n", "\n", "You can also check my blog post on [\"What is Machine\n", "Learning?\"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)\n", "\n", "In practice, we normally also have uncertainty associated with these\n", "functions. Uncertainty in the prediction function arises from\n", "\n", "1. scarcity of training data and\n", "2. mismatch between the set of prediction functions we choose and all\n", " possible prediction functions.\n", "\n", "There are also challenges around specification of the objective\n", "function, but for we will save those for another day. For the moment,\n", "let us focus on the prediction function.\n", "\n", "## Neural Networks and Prediction Functions [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "Neural networks are adaptive non-linear function models. Originally,\n", "they were studied (by McCulloch and Pitts [@McCulloch:neuron43]) as\n", "simple models for neurons, but over the last decade they have become\n", "popular because they are a flexible approach to modelling complex data.\n", "A particular characteristic of neural network models is that they can be\n", "composed to form highly complex functions which encode many of our\n", "expectations of the real world. They allow us to encode our assumptions\n", "about how the world works.\n", "\n", "We will return to composition later, but for the moment, let's focus on\n", "a one hidden layer neural network. We are interested in the prediction\n", "function, so we'll ignore the objective function (which is often called\n", "an error function) for the moment, and just describe the mathematical\n", "object of interest\n", "\n", "$$\n", "\\mappingFunction(\\inputVector) = \\mappingMatrix^\\top \\activationVector(\\mappingMatrixTwo, \\inputVector)\n", "$$\n", "\n", "Where in this case $\\mappingFunction(\\cdot)$ is a scalar function with\n", "vector inputs, and $\\activationVector(\\cdot)$ is a vector function with\n", "vector inputs. The dimensionality of the vector function is known as the\n", "number of hidden units, or the number of neurons. The elements of this\n", "vector function are known as the *activation* function of the neural\n", "network and $\\mappingMatrixTwo$ are the parameters of the activation\n", "functions.\n", "\n", "## Relations with Classical Statistics\n", "\n", "In statistics activation functions are traditionally known as *basis\n", "functions*. And we would think of this as a *linear model*. It's doesn't\n", "make linear predictions, but it's linear because in statistics\n", "estimation focuses on the parameters, $\\mappingMatrix$, not the\n", "parameters, $\\mappingMatrixTwo$. The linear model terminology refers to\n", "the fact that the model is *linear in the parameters*, but it is *not*\n", "linear in the data unless the activation functions are chosen to be\n", "linear.\n", "\n", "## Adaptive Basis Functions\n", "\n", "The first difference in the (early) neural network literature to the\n", "classical statistical literature is the decision to optimize these\n", "parameters, $\\mappingMatrixTwo$, as well as the parameters,\n", "$\\mappingMatrix$ (which would normally be denoted in statistics by\n", "$\\boldsymbol{\\beta}$)[^1].\n", "\n", "In this tutorial, we're going to go revisit that decision, and follow\n", "the path of Radford Neal [@Neal:bayesian94] who, inspired by work of\n", "David MacKay [@MacKay:bayesian92] and others did his PhD thesis on\n", "Bayesian Neural Networks. If we take a Bayesian approach to parameter\n", "inference (note I am using inference here in the classical sense, not in\n", "the sense of prediction of test data, which seems to be a newer usage),\n", "then we don't wish to fit parameters at all, rather we wish to integrate\n", "them away and understand the family of functions that the model\n", "describes.\n", "\n", "## Probabilistic Modelling [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "This Bayesian approach is designed to deal with uncertainty arising from\n", "fitting our prediction function to the data we have, a reduced data set.\n", "\n", "The Bayesian approach can be derived from a broader understanding of\n", "what our objective is. If we accept that we can jointly represent all\n", "things that happen in the world with a probability distribution, then we\n", "can interogate that probability to make predictions. So, if we are\n", "interested in predictions, $\\dataScalar_*$ at future points input\n", "locations of interest, $\\inputVector_*$ given previously training data,\n", "$\\dataVector$ and corresponding inputs, $\\inputMatrix$, then we are\n", "really interogating the following probability density, $$\n", "p(\\dataScalar_*|\\dataVector, \\inputMatrix, \\inputVector_*),\n", "$$ there is nothing controversial here, as long as you accept that you\n", "have a good joint model of the world around you that relates test data\n", "to training data,\n", "$p(\\dataScalar_*, \\dataVector, \\inputMatrix, \\inputVector_*)$ then this\n", "conditional distribution can be recovered through standard rules of\n", "probability\n", "($\\text{data} + \\text{model} \\rightarrow \\text{prediction}$).\n", "\n", "We can construct this joint density through the use of the following\n", "decomposition: $$\n", "p(\\dataScalar_*|\\dataVector, \\inputMatrix, \\inputVector_*) = \\int p(\\dataScalar_*|\\inputVector_*, \\mappingMatrix) p(\\mappingMatrix | \\dataVector, \\inputMatrix) \\text{d} \\mappingMatrix\n", "$$\n", "\n", "where, for convenience, we are assuming *all* the parameters of the\n", "model are now represented by $\\parameterVector$ (which contains\n", "$\\mappingMatrix$ and $\\mappingMatrixTwo$) and\n", "$p(\\parameterVector | \\dataVector, \\inputMatrix)$ is recognised as the\n", "posterior density of the parameters given data and\n", "$p(\\dataScalar_*|\\inputVector_*, \\parameterVector)$ is the *likelihood*\n", "of an individual test data point given the parameters.\n", "\n", "The likelihood of the data is normally assumed to be independent across\n", "the parameters, $$\n", "p(\\dataVector|\\inputMatrix, \\mappingMatrix) = \\prod_{i=1}^\\numData p(\\dataScalar_i|\\inputVector_i, \\mappingMatrix),$$\n", "\n", "and if that is so, it is easy to extend our predictions across all\n", "future, potential, locations, $$\n", "p(\\dataVector_*|\\dataVector, \\inputMatrix, \\inputMatrix_*) = \\int p(\\dataVector_*|\\inputMatrix_*, \\parameterVector) p(\\parameterVector | \\dataVector, \\inputMatrix) \\text{d} \\parameterVector.\n", "$$\n", "\n", "The likelihood is also where the *prediction function* is incorporated.\n", "For example in the regression case, we consider an objective based\n", "around the Gaussian density, $$\n", "p(\\dataScalar_i | \\mappingFunction(\\inputVector_i)) = \\frac{1}{\\sqrt{2\\pi \\dataStd^2}} \\exp\\left(-\\frac{\\left(\\dataScalar_i - \\mappingFunction(\\inputVector_i)\\right)^2}{2\\dataStd^2}\\right)\n", "$$\n", "\n", "In short, that is the classical approach to probabilistic inference, and\n", "all approaches to Bayesian neural networks fall within this path. For a\n", "deep probabilistic model, we can simply take this one stage further and\n", "place a probability distribution over the input locations, $$\n", "p(\\dataVector_*|\\dataVector) = \\int p(\\dataVector_*|\\inputMatrix_*, \\parameterVector) p(\\parameterVector | \\dataVector, \\inputMatrix) p(\\inputMatrix) p(\\inputMatrix_*) \\text{d} \\parameterVector \\text{d} \\inputMatrix \\text{d}\\inputMatrix_*\n", "$$ and we have *unsupervised learning* (from where we can get deep\n", "generative models).\n", "\n", "## Graphical Models [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "One way of representing a joint distribution is to consider conditional\n", "dependencies between data. Conditional dependencies allow us to\n", "factorize the distribution. For example, a Markov chain is a\n", "factorization of a distribution into components that represent the\n", "conditional relationships between points that are neighboring, often in\n", "time or space. It can be decomposed in the following form.\n", "$$p(\\dataVector) = p(\\dataScalar_\\numData | \\dataScalar_{\\numData-1}) p(\\dataScalar_{\\numData-1}|\\dataScalar_{\\numData-2}) \\dots p(\\dataScalar_{2} | \\dataScalar_{1})$$\n", "\n", " \n", "\n", "By specifying conditional independencies we can reduce the\n", "parameterization required for our data, instead of directly specifying\n", "the parameters of the joint distribution, we can specify each set of\n", "parameters of the conditonal independently. This can also give an\n", "advantage in terms of interpretability. Understanding a conditional\n", "independence structure gives a structured understanding of data. If\n", "developed correctly, according to causal methodology, it can even inform\n", "how we should intervene in the system to drive a desired result\n", "[@Pearl:causality95].\n", "\n", "However, a challenge arises when the data becomes more complex. Consider\n", "the graphical model shown below, used to predict the perioperative risk\n", "of *C Difficile* infection following colon surgery\n", "[@Steele:predictive12].\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "To capture the complexity in the interelationship between the data, the\n", "graph itself becomes more complex, and less interpretable.\n", "\n", "## Performing Inference\n", "\n", "As far as combining our data and our model to form our prediction, the\n", "devil is in the detail. While everything is easy to write in terms of\n", "probability densities, as we move from $\\text{data}$ and $\\text{model}$\n", "to $\\text{prediction}$ there is that simple\n", "$\\xrightarrow{\\text{compute}}$ sign, which is now burying a wealth of\n", "difficulties. Each integral sign above is a high dimensional integral\n", "which will typically need approximation. Approximations also come with\n", "computational demands. As we consider more complex classes of functions,\n", "the challenges around the integrals become harder and prediction of\n", "future test data given our model and the data becomes so involved as to\n", "be impractical or impossible.\n", "\n", "Statisticians realized these challenges early on, indeed, so early that\n", "they were actually physicists, both Laplace and Gauss worked on models\n", "such as this, in Gauss's case he made his career on prediction of the\n", "location of the lost planet (later reclassified as a asteroid, then\n", "dwarf planet), Ceres. Gauss and Laplace made use of maximum a posteriori\n", "estimates for simplifying their computations and Laplace developed\n", "Laplace's method (and invented the Gaussian density) to expand around\n", "that mode. But classical statistics needs better guarantees around model\n", "performance and interpretation, and as a result has focussed more on the\n", "*linear* model implied by $$\n", " \\mappingFunction(\\inputVector) = \\left.\\mappingVector^{(2)}\\right.^\\top \\activationVector(\\mappingMatrix_1, \\inputVector)\n", "$$\n", "\n", "$$\n", " \\mappingVector^{(2)} \\sim \\gaussianSamp{\\zerosVector}{\\covarianceMatrix}.\n", "$$\n", "\n", "The Gaussian likelihood given above implies that the data observation is\n", "related to the function by noise corruption so we have, $$\n", " \\dataScalar_i = \\mappingFunction(\\inputVector_i) + \\noiseScalar_i,\n", "$$ where $$\n", " \\noiseScalar_i \\sim \\gaussianSamp{0}{\\dataStd^2}\n", "$$\n", "\n", "and while normally integrating over high dimensional parameter vectors\n", "is highly complex, here it is *trivial*. That is because of a property\n", "of the multivariate Gaussian.\n", "\n", "Gaussian processes are initially of interest because\n", "\n", "1. linear Gaussian models are easier to deal with\n", "2. Even the parameters *within* the process can be handled, by\n", " considering a particular limit.\n", "\n", "Let's first of all review the properties of the multivariate Gaussian\n", "distribution that make linear Gaussian models easier to deal with. We'll\n", "return to the, perhaps surprising, result on the parameters within the\n", "nonlinearity, $\\parameterVector$, shortly.\n", "\n", "To work with linear Gaussian models, to find the marginal likelihood all\n", "you need to know is the following rules. If $$\n", "\\dataVector = \\mappingMatrix \\inputVector + \\noiseVector,\n", "$$ where $\\dataVector$, $\\inputVector$ and $\\noiseVector$ are vectors\n", "and we assume that $\\inputVector$ and $\\noiseVector$ are drawn from\n", "multivariate Gaussians, \n", "\\begin{align}\n", "\\inputVector & \\sim \\gaussianSamp{\\meanVector}{\\covarianceMatrix}\\\\\n", "\\noiseVector & \\sim \\gaussianSamp{\\zerosVector}{\\covarianceMatrixTwo}\n", "\\end{align}\n", " then we know that $\\dataVector$ is also drawn from a multivariate\n", "Gaussian with, $$\n", "\\dataVector \\sim \\gaussianSamp{\\mappingMatrix\\meanVector}{\\mappingMatrix\\covarianceMatrix\\mappingMatrix^\\top + \\covarianceMatrixTwo}.\n", "$$\n", "\n", "With apprioriately defined covariance, $\\covarianceMatrixTwo$, this is\n", "actually the marginal likelihood for Factor Analysis, or Probabilistic\n", "Principal Component Analysis [@Tipping:probpca99], because we integrated\n", "out the inputs (or *latent* variables they would be called in that\n", "case).\n", "\n", "However, we are focussing on what happens in models which are non-linear\n", "in the inputs, whereas the above would be *linear* in the inputs. To\n", "consider these, we introduce a matrix, called the design matrix. We set\n", "each activation function computed at each data point to be $$\n", "\\activationScalar_{i,j} = \\activationScalar(\\mappingVector^{(1)}_{j}, \\inputVector_{i})\n", "$$ and define the matrix of activations (known as the *design matrix* in\n", "statistics) to be, $$\n", "\\activationMatrix = \n", "\\begin{bmatrix}\n", "\\activationScalar_{1, 1} & \\activationScalar_{1, 2} & \\dots & \\activationScalar_{1, \\numHidden} \\\\\n", "\\activationScalar_{1, 2} & \\activationScalar_{1, 2} & \\dots & \\activationScalar_{1, \\numData} \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "\\activationScalar_{\\numData, 1} & \\activationScalar_{\\numData, 2} & \\dots & \\activationScalar_{\\numData, \\numHidden}\n", "\\end{bmatrix}.\n", "$$ By convention this matrix always has $\\numData$ rows and $\\numHidden$\n", "columns, now if we define the vector of all noise corruptions,\n", "$\\noiseVector = \\left[\\noiseScalar_1, \\dots \\noiseScalar_\\numData\\right]^\\top$.\n", "\n", "If we define the prior distribution over the vector $\\mappingVector$ to\n", "be Gaussian, $$\n", "\\mappingVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\eye},\n", "$$\n", "\n", "then we can use rules of multivariate Gaussians to see that, $$\n", "\\dataVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha \\activationMatrix \\activationMatrix^\\top + \\dataStd^2 \\eye}.\n", "$$\n", "\n", "In other words, our training data is distributed as a multivariate\n", "Gaussian, with zero mean and a covariance given by $$\n", "\\kernelMatrix = \\alpha \\activationMatrix \\activationMatrix^\\top + \\dataStd^2 \\eye.\n", "$$\n", "\n", "This is an $\\numData \\times \\numData$ size matrix. Its elements are in\n", "the form of a function. The maths shows that any element, index by $i$\n", "and $j$, is a function *only* of inputs associated with data points $i$\n", "and $j$, $\\dataVector_i$, $\\dataVector_j$.\n", "$\\kernel_{i,j} = \\kernel\\left(\\inputVector_i, \\inputVector_j\\right)$\n", "\n", "If we look at the portion of this function associated only with\n", "$\\mappingFunction(\\cdot)$, i.e. we remove the noise, then we can write\n", "down the covariance associated with our neural network, $$\n", "\\kernel_\\mappingFunction\\left(\\inputVector_i, \\inputVector_j\\right) = \\alpha \\activationVector\\left(\\mappingMatrix_1, \\inputVector_i\\right)^\\top \\activationVector\\left(\\mappingMatrix_1, \\inputVector_j\\right)\n", "$$ so the elements of the covariance or *kernel* matrix are formed by\n", "inner products of the rows of the *design matrix*.\n", "\n", "## Gaussian Process [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "This is the essence of a Gaussian process. Instead of making assumptions\n", "about our density over each data point, $\\dataScalar_i$ as i.i.d. we\n", "make a joint Gaussian assumption over our data. The covariance matrix is\n", "now a function of both the parameters of the activation function,\n", "$\\mappingMatrixTwo$, and the input variables, $\\inputMatrix$. This comes\n", "about through integrating out the parameters of the model,\n", "$\\mappingVector$.\n", "\n", "## Basis Functions\n", "\n", "We can basically put anything inside the basis functions, and many\n", "people do. These can be deep kernels [@Cho:deep09] or we can learn the\n", "parameters of a convolutional neural network inside there.\n", "\n", "Viewing a neural network in this way is also what allows us to beform\n", "sensible *batch* normalizations [@Ioffe:batch15].\n", "\n", "## Non-degenerate Gaussian Processes [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "The process described above is degenerate. The covariance function is of\n", "rank at most $\\numHidden$ and since the theoretical amount of data could\n", "always increase $\\numData \\rightarrow \\infty$, the covariance function\n", "is not full rank. This means as we increase the amount of data to\n", "infinity, there will come a point where we can't normalize the process\n", "because the multivariate Gaussian has the form, $$\n", "\\gaussianDist{\\mappingFunctionVector}{\\zerosVector}{\\kernelMatrix} = \\frac{1}{\\left(2\\pi\\right)^{\\frac{\\numData}{2}}\\det{\\kernelMatrix}^\\frac{1}{2}} \\exp\\left(-\\frac{\\mappingFunctionVector^\\top\\kernelMatrix \\mappingFunctionVector}{2}\\right)\n", "$$ and a non-degenerate kernel matrix leads to $\\det{\\kernelMatrix} = 0$\n", "defeating the normalization (it's equivalent to finding a projection in\n", "the high dimensional Gaussian where the variance of the the resulting\n", "univariate Gaussian is zero, i.e. there is a null space on the\n", "covariance, or alternatively you can imagine there are one or more\n", "directions where the Gaussian has become the delta function).\n", "\n", "In the machine learning field, it was Radford Neal [@Neal:bayesian94]\n", "that realized the potential of the next step. In his 1994 thesis, he was\n", "considering Bayesian neural networks, of the type we described above,\n", "and in considered what would happen if you took the number of hidden\n", "nodes, or neurons, to infinity, i.e. $\\numHidden \\rightarrow \\infty$.\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*Page 37 of [Radford Neal's 1994\n", "thesis](http://www.cs.toronto.edu/~radford/ftp/thesis.pdf)*\n", "
\n", "In loose terms, what Radford considers is what happens to the elements\n", "of the covariance function, \n", " \\begin{align*}\n", " \\kernel_\\mappingFunction\\left(\\inputVector_i, \\inputVector_j\\right) & = \\alpha \\activationVector\\left(\\mappingMatrix_1, \\inputVector_i\\right)^\\top \\activationVector\\left(\\mappingMatrix_1, \\inputVector_j\\right)\\\\\n", " & = \\alpha \\sum_k \\activationScalar\\left(\\mappingVector^{(1)}_k, \\inputVector_i\\right) \\activationScalar\\left(\\mappingVector^{(1)}_k, \\inputVector_j\\right)\n", " \\end{align*}\n", " if instead of considering a finite number you sample infinitely\n", "many of these activation functions, sampling parameters from a prior\n", "density, $p(\\mappingVectorTwo)$, for each one, $$\n", "\\kernel_\\mappingFunction\\left(\\inputVector_i, \\inputVector_j\\right) = \\alpha \\int \\activationScalar\\left(\\mappingVector^{(1)}, \\inputVector_i\\right) \\activationScalar\\left(\\mappingVector^{(1)}, \\inputVector_j\\right) p(\\mappingVector^{(1)}) \\text{d}\\mappingVector^{(1)}\n", "$$ And that's not *only* for Gaussian $p(\\mappingVectorTwo)$. In fact\n", "this result holds for a range of activations, and a range of prior\n", "densities because of the *central limit theorem*.\n", "\n", "To write it in the form of a probabilistic program, as long as the\n", "distribution for $\\phi_i$ implied by this short probabilistic program,\n", "\n", " \\begin{align*}\n", " \\mappingVectorTwo & \\sim p(\\cdot)\\\\\n", " \\phi_i & = \\activationScalar\\left(\\mappingVectorTwo, \\inputVector_i\\right), \n", " \\end{align*}\n", " has finite variance, then the result of taking the number of hidden\n", "units to infinity, with appropriate scaling, is also a Gaussian process.\n", "\n", "## Further Reading\n", "\n", "To understand this argument in more detail, I highly recommend reading\n", "chapter 2 of Neal's thesis [@Neal:bayesian94], which remains easy to\n", "read and clear today. Indeed, for readers interested in Bayesian neural\n", "networks, both Raford Neal's and David MacKay's PhD thesis\n", "[@MacKay:bayesian92] remain essential reading. Both theses embody a\n", "clarity of thought, and an ability to weave together threads from\n", "different fields that was the business of machine learning in the 1990s.\n", "Radford and David were also pioneers in making their software widely\n", "available and publishing material on the web.\n", "\n", "## Bayesian Inference by Rejection Sampling [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "One view of Bayesian inference is to assume we are given a mechanism for\n", "generating samples, where we assume that mechanism is representing on\n", "accurate view on the way we believe the world works.\n", "\n", "This mechanism is known as our *prior* belief.\n", "\n", "We combine our prior belief with our observations of the real world by\n", "discarding all those samples that are inconsistent with our prior. The\n", "*likelihood* defines mathematically what we mean by inconsistent with\n", "the prior. The higher the noise level in the likelihood, the looser the\n", "notion of consistent.\n", "\n", "The samples that remain are considered to be samples from the\n", "*posterior*.\n", "\n", "This approach to Bayesian inference is closely related to two sampling\n", "techniques known as *rejection sampling* and *importance sampling*. It\n", "is realized in practice in an approach known as *approximate Bayesian\n", "computation* (ABC) or likelihood-free inference.\n", "\n", "In practice, the algorithm is often too slow to be practical, because\n", "most samples will be inconsistent with the data and as a result the\n", "mechanism has to be operated many times to obtain a few posterior\n", "samples.\n", "\n", "However, in the Gaussian process case, when the likelihood also assumes\n", "Gaussian noise, we can operate this mechanims mathematically, and obtain\n", "the posterior density *analytically*. This is the benefit of Gaussian\n", "processes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('gp_rejection_sample{sample:0>3}.png', \n", " directory='../slides/diagrams/gp', \n", " sample=IntSlider(1,1,5,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\includesvg{../slides/diagrams/gp/gp_rejection_sample003}\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*One view of Bayesian inference is we have a machine for generating\n", "samples (the *prior*), and we discard all samples inconsistent with our\n", "data, leaving the samples of interest (the *posterior*). The Gaussian\n", "process allows us to do this analytically.*\n", "
\n", "\n", "\n", "## Sampling a Function [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "We will consider a Gaussian distribution with a particular structure of\n", "covariance matrix. We will generate *one* sample from a 25-dimensional\n", "Gaussian density. $$\n", "\\mappingFunctionVector=\\left[\\mappingFunction_{1},\\mappingFunction_{2}\\dots \\mappingFunction_{25}\\right].\n", "$$ in the figure below we plot these data on the $y$-axis against their\n", "*indices* on the $x$-axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s Kernel mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s polynomial_cov mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s exponentiated_quadratic mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(0, 0, 8, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "
\n", "*A 25 dimensional correlated random variable (values ploted against\n", "index)*\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(9, 9, 12, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "
\n", "*The joint Gaussian over $\\mappingFunction_1$ and $\\mappingFunction_2$\n", "along with the conditional distribution of $\\mappingFunction_2$ given\n", "$\\mappingFunction_1$*\n", "
\n", "## Uluru [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "When viewing these contour plots, I sometimes find it helpful to think\n", "of Uluru, the prominent rock formation in Australia. The rock rises\n", "above the surface of the plane, just like a probability density rising\n", "above the zero line. The rock is three dimensional, but when we view\n", "Uluru from the classical position, we are looking at one side of it.\n", "This is equivalent to viewing the marginal density.\n", "\n", "The joint density can be viewed from above, using contours. The\n", "conditional density is equivalent to *slicing* the rock. Uluru is a holy\n", "rock, so this has to be an imaginary slice. Imagine we cut down a\n", "vertical plane orthogonal to our view point (e.g. coming across our view\n", "point). This would give a profile of the rock, which when renormalized,\n", "would give us the conditional distribution, the value of conditioning\n", "would be the location of the slice in the direction we are facing.\n", "\n", "## Prediction with Correlated Gaussians\n", "\n", "Of course in practice, rather than manipulating mountains physically,\n", "the advantage of the Gaussian density is that we can perform these\n", "manipulations mathematically.\n", "\n", "Prediction of $\\mappingFunction_2$ given $\\mappingFunction_1$ requires\n", "the *conditional density*,\n", "$p(\\mappingFunction_2|\\mappingFunction_1)$.Another remarkable property\n", "of the Gaussian density is that this conditional distribution is *also*\n", "guaranteed to be a Gaussian density. It has the form, $$\n", " p(\\mappingFunction_2|\\mappingFunction_1) = \\gaussianDist{\\mappingFunction_2}{\\frac{\\kernelScalar_{1, 2}}{\\kernelScalar_{1, 1}}\\mappingFunction_1}{ \\kernelScalar_{2, 2} - \\frac{\\kernelScalar_{1,2}^2}{\\kernelScalar_{1,1}}}\n", "$$where we have assumed that the covariance of the original joint\n", "density was given by $$\n", " \\kernelMatrix = \\begin{bmatrix} \\kernelScalar_{1, 1} & \\kernelScalar_{1, 2}\\\\ \\kernelScalar_{2, 1} & \\kernelScalar_{2, 2}.\\end{bmatrix}\n", "$$\n", "\n", "Using these formulae we can determine the conditional density for any of\n", "the elements of our vector $\\mappingFunctionVector$. For example, the\n", "variable $\\mappingFunction_8$ is less correlated with\n", "$\\mappingFunction_1$ than $\\mappingFunction_2$. If we consider this\n", "variable we see the conditional density is more diffuse." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(13, 13, 17, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", " \n", "
\n", "*The joint Gaussian over $\\mappingFunction_1$ and $\\mappingFunction_8$\n", "along with the conditional distribution of $\\mappingFunction_8$ given\n", "$\\mappingFunction_1$*\n", "
\n", "- Covariance function, $\\kernelMatrix$\n", "- Determines properties of samples.\n", "- Function of $\\inputMatrix$,\n", " $$\\kernelScalar_{i,j} = \\kernelScalar(\\inputVector_i, \\inputVector_j)$$\n", "\n", "- Posterior mean\n", " $$\\mappingFunction_D(\\inputVector_*) = \\kernelVector(\\inputVector_*, \\inputMatrix) \\kernelMatrix^{-1}\n", " \\mathbf{y}$$\n", "\n", "- Posterior covariance\n", " $$\\mathbf{C}_* = \\kernelMatrix_{*,*} - \\kernelMatrix_{*,\\mappingFunctionVector}\n", " \\kernelMatrix^{-1} \\kernelMatrix_{\\mappingFunctionVector, *}$$\n", "\n", "- Posterior mean" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "$$\\mappingFunction_D(\\inputVector_*) = \\kernelVector(\\inputVector_*, \\inputMatrix) \\boldsymbol{\\alpha}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Posterior covariance\n", " $$\\covarianceMatrix_* = \\kernelMatrix_{*,*} - \\kernelMatrix_{*,\\mappingFunctionVector}\n", " \\kernelMatrix^{-1} \\kernelMatrix_{\\mappingFunctionVector, *}$$\n", "\n", "## Exponentiated Quadratic Covariance [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "The exponentiated quadratic covariance, also known as the Gaussian\n", "covariance or the RBF covariance and the squared exponential. Covariance\n", "between two points is related to the negative exponential of the squared\n", "distnace between those points. This covariance function can be derived\n", "in a few different ways: as the infinite limit of a radial basis\n", "function neural network, as diffusion in the heat equation, as a\n", "Gaussian filter in *Fourier space* or as the composition as a series of\n", "linear filters applied to a base function.\n", "\n", "The covariance takes the following form, $$\n", "\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\exp\\left(-\\frac{\\ltwoNorm{\\inputVector-\\inputVector^\\prime}^2}{2\\lengthScale^2}\\right)\n", "$$ where $\\ell$ is the *length scale* or *time scale* of the process and\n", "$\\alpha$ represents the overall process variance.\n", "\n", "
\n", "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\exp\\left(-\\frac{\\ltwoNorm{\\inputVector-\\inputVector^\\prime}^2}{2\\lengthScale^2}\\right)$$\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "
\n", "
\n", "*The exponentiated quadratic covariance function.*\n", "
\n", "## Olympic Marathon Data [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "\n", "\n", "\n", "\n", "\n", "
 \n", "- Gold medal times for Olympic Marathon since 1896.\n", "- Marathons before 1924 didn’t have a standardised distance.\n", "- Present results using pace per km.\n", "- In 1904 Marathon was badly organised leading to very slow times.\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "Image from Wikimedia Commons \n", "
\n", "The first thing we will do is load a standard data set for regression\n", "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n", "winners for the Olympics from 1896 to present. First we load in the data\n", "and plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.olympic_marathon_men()\n", "x = data['X']\n", "y = data['Y']\n", "\n", "offset = y.mean()\n", "scale = np.sqrt(y.var())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import teaching_plots as plot\n", "import mlai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "xlim = (1875,2030)\n", "ylim = (2.5, 6.5)\n", "yhat = (y-offset)/scale\n", "\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "_ = ax.plot(x, y, 'r.',markersize=10)\n", "ax.set_xlabel('year', fontsize=20)\n", "ax.set_ylabel('pace min/km', fontsize=20)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "\n", "mlai.write_figure(figure=fig, \n", " filename='../slides/diagrams/datasets/olympic-marathon.svg', \n", " transparent=True, \n", " frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "\n", "
\n", "\n", "Things to notice about the data include the outlier in 1904, in this\n", "year, the olympics was in St Louis, USA. Organizational problems and\n", "challenges with dust kicked up by the cars following the race meant that\n", "participants got lost, and only very few participants completed.\n", "\n", "More recent years see more consistently quick marathons.\n", "\n", "## Alan Turing [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "\n", "\n", "\n", "\n", "\n", "
 \n", "\n", "\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "
\n", "
\n", "*Alan Turing, in 1946 he was only 11 minutes slower than the winner of\n", "the 1948 games. Would he have won a hypothetical games held in 1946?\n", "Source: [Alan Turing Internet\n", "Scrapbook](http://www.turing.org.uk/scrapbook/run.html).*\n", "
\n", "If we had to summarise the objectives of machine learning in one word, a\n", "very good candidate for that word would be *generalization*. What is\n", "generalization? From a human perspective it might be summarised as the\n", "ability to take lessons learned in one domain and apply them to another\n", "domain. If we accept the definition given in the first session for\n", "machine learning, $$\n", "\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}\n", "$$ then we see that without a model we can't generalise: we only have\n", "data. Data is fine for answering very specific questions, like \"Who won\n", "the Olympic Marathon in 2012?\", because we have that answer stored,\n", "however, we are not given the answer to many other questions. For\n", "example, Alan Turing was a formidable marathon runner, in 1946 he ran a\n", "time 2 hours 46 minutes (just under four minutes per kilometer, faster\n", "than I and most of the other [Endcliffe Park\n", "Run](http://www.parkrun.org.uk/sheffieldhallam/) runners can do 5 km).\n", "What is the probability he would have won an Olympics if one had been\n", "held in 1946?\n", "\n", "To answer this question we need to generalize, but before we formalize\n", "the concept of generalization let's introduce some formal representation\n", "of what it means to generalize in machine learning.\n", "\n", "\n", "Our first objective will be to perform a Gaussian process fit to the\n", "data, we'll do this using the [GPy\n", "software](https://github.com/SheffieldML/GPy)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import GPy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m_full = GPy.models.GPRegression(x,yhat)\n", "_ = m_full.optimize() # Optimize parameters of covariance function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first command sets up the model, then m_full.optimize() optimizes\n", "the parameters of the covariance function and the noise level of the\n", "model. Once the fit is complete, we'll try creating some test points,\n", "and computing the output of the GP model in terms of the mean and\n", "standard deviation of the posterior functions between 1870 and 2030. We\n", "plot the mean function and the standard deviation at 200 locations. We\n", "can obtain the predictions using y_mean, y_var = m_full.predict(xt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xt = np.linspace(1870,2030,200)[:,np.newaxis]\n", "yt_mean, yt_var = m_full.predict(xt)\n", "yt_sd=np.sqrt(yt_var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we plot the results using the helper function in teaching_plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(figure=fig,\n", " filename='../slides/diagrams/gp/olympic-marathon-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "## Fit Quality\n", "\n", "In the fit we see that the error bars (coming mainly from the noise\n", "variance) are quite large. This is likely due to the outlier point in\n", "1904, ignoring that point we can see that a tighter fit is obtained. To\n", "see this making a version of the model, m_clean, where that point is\n", "removed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_clean=np.vstack((x[0:2, :], x[3:, :]))\n", "y_clean=np.vstack((y[0:2, :], y[3:, :]))\n", "\n", "m_clean = GPy.models.GPRegression(x_clean,y_clean)\n", "_ = m_clean.optimize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we determine covariance parameters from the data?\n", "\n", "$$\\gaussianDist{\\dataVector}{\\mathbf{0}}{\\kernelMatrix}=\\frac{1}{(2\\pi)^\\frac{\\numData}{2}{\\det{\\kernelMatrix}^{\\frac{1}{2}}}}{\\exp\\left(-\\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}\\right)}$$\n", "\n", "\\begin{aligned}\n", " \\gaussianDist{\\dataVector}{\\mathbf{0}}{\\kernelMatrix}=\\frac{1}{(2\\pi)^\\frac{\\numData}{2}{\\color{black} \\det{\\kernelMatrix}^{\\frac{1}{2}}}}{\\color{black}\\exp\\left(-\\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}\\right)}\n", "\\end{aligned}\n", "\n", "\n", "\n", "\\begin{aligned}\n", " \\log \\gaussianDist{\\dataVector}{\\mathbf{0}}{\\kernelMatrix}=&{\\color{black}-\\frac{1}{2}\\log\\det{\\kernelMatrix}}{\\color{black}-\\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}} \\\\ &-\\frac{\\numData}{2}\\log2\\pi\n", "\\end{aligned}\n", "\n", "\n", "$$\n", "\\errorFunction(\\parameterVector) = {\\color{black} \\frac{1}{2}\\log\\det{\\kernelMatrix}} + {\\color{black} \\frac{\\dataVector^{\\top}\\kernelMatrix^{-1}\\dataVector}{2}}\n", "$$\n", "\n", "The parameters are *inside* the covariance function (matrix).\n", "\\normalsize\n", "$$\\kernelScalar_{i, j} = \\kernelScalar(\\inputVals_i, \\inputVals_j; \\parameterVector)$$\n", "\n", "[\\Large\n", "$$\\kernelMatrix = \\rotationMatrix \\eigenvalueMatrix^2 \\rotationMatrix^\\top$$]{}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gpoptimizePlot1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
 \n", "\n", "\n", " \n", "\n", "\n", "\n", " \n", "$\\eigenvalueMatrix$ represents distance on axes. $\\rotationMatrix$ gives\n", "rotation.\n", "
\n", "- $\\eigenvalueMatrix$ is *diagonal*,\n", " $\\rotationMatrix^\\top\\rotationMatrix = \\eye$.\n", "- Useful representation since\n", " $\\det{\\kernelMatrix} = \\det{\\eigenvalueMatrix^2} = \\det{\\eigenvalueMatrix}^2$.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\\includesvg{../slides/diagrams/gp/gp-optimise010}\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\\includesvg{../slides/diagrams/gp/gp-optimise021}\n", "
\n", "
\n", "*Variation in the data fit term, the capacity term and the negative log\n", "likelihood for different lengthscales.*\n", "
\n", "## Della Gatta Gene Data [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "- Given given expression levels in the form of a time series from\n", " @DellaGatta:direct08.\n", "\n", " {.python}\n", "import numpy as np\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.della_gatta_TRP63_gene_expression(data_set='della_gatta',gene_number=937)\n", "\n", "x = data['X']\n", "y = data['Y']\n", "\n", "offset = y.mean()\n", "scale = np.sqrt(y.var())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import teaching_plots as plot\n", "import mlai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "xlim = (-20,260)\n", "ylim = (5, 7.5)\n", "yhat = (y-offset)/scale\n", "\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "_ = ax.plot(x, y, 'r.',markersize=10)\n", "ax.set_xlabel('time/min', fontsize=20)\n", "ax.set_ylabel('expression', fontsize=20)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "\n", "mlai.write_figure(figure=fig, \n", " filename='../slides/diagrams/datasets/della-gatta-gene.svg', \n", " transparent=True, \n", " frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "\n", "
\n", "\n", "- Want to detect if a gene is expressed or not, fit a GP to each gene\n", " @Kalaitzis:simple11.\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "Our first objective will be to perform a Gaussian process fit to the\n", "data, we'll do this using the [GPy\n", "software](https://github.com/SheffieldML/GPy)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import GPy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m_full = GPy.models.GPRegression(x,yhat)\n", "m_full.kern.lengthscale=50\n", "_ = m_full.optimize() # Optimize parameters of covariance function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initialize the length scale parameter (which here actually represents a\n", "*time scale* of the covariance function to a reasonable value. Default\n", "would be 1, but here we set it to 50 minutes, given points are arriving\n", "across zero to 250 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xt = np.linspace(-20,260,200)[:,np.newaxis]\n", "yt_mean, yt_var = m_full.predict(xt)\n", "yt_sd=np.sqrt(yt_var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we plot the results using the helper function in teaching_plots`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full.log_likelihood()), fontsize=20)\n", "mlai.write_figure(figure=fig,\n", " filename='../slides/diagrams/gp/della-gatta-gene-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "Now we try a model initialized with a longer length scale." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m_full2 = GPy.models.GPRegression(x,yhat)\n", "m_full2.kern.lengthscale=2000\n", "_ = m_full2.optimize() # Optimize parameters of covariance function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.model_output(m_full2, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full2.log_likelihood()), fontsize=20)\n", "mlai.write_figure(figure=fig,\n", " filename='../slides/diagrams/gp/della-gatta-gene-gp2.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "Now we try a model initialized with a lower noise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m_full3 = GPy.models.GPRegression(x,yhat)\n", "m_full3.kern.lengthscale=20\n", "m_full3.likelihood.variance=0.001\n", "_ = m_full3.optimize() # Optimize parameters of covariance function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.model_output(m_full3, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full3.log_likelihood()), fontsize=20)\n", "mlai.write_figure(figure=fig,\n", " filename='../slides/diagrams/gp/della-gatta-gene-gp3.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", " \n", "\n", "\n", "## Example: Prediction of Malaria Incidence in Uganda [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "[ ]{style=\"text-align:right\"}\n", "\n", "As an example of using Gaussian process models within the full pipeline\n", "from data to decsion, we'll consider the prediction of Malaria incidence\n", "in Uganda. For the purposes of this study malaria reports come in two\n", "forms, HMIS reports from health centres and Sentinel data, which is\n", "curated by the WHO. There are limited sentinel sites and many HMIS\n", "sites.\n", "\n", "The work is from Ricardo Andrade Pacheco's PhD thesis, completed in\n", "collaboration with John Quinn and Martin Mubangizi\n", "[@Andrade:consistent14; @Mubangizi:malaria14]. John and Martin were\n", "initally from the AI-DEV group from the University of Makerere in\n", "Kampala and more latterly they were based at UN Global Pulse in Kampala.\n", "\n", "Malaria data is spatial data. Uganda is split into districts, and health\n", "reports can be found for each district. This suggests that models such\n", "as conditional random fields could be used for spatial modelling, but\n", "there are two complexities with this. First of all, occasionally\n", "districts split into two. Secondly, sentinel sites are a specific\n", "location within a district, such as Nagongera which is a sentinel site\n", "based in the Tororo district.\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*Ugandan districs. Data SRTM/NASA from\n", "*\n", "
\n", "[[@Andrade:consistent14; @Mubangizi:malaria14]]{style=\"text-align:right\"}\n", "\n", " \n", "
\n", "*The Kapchorwa District, home district of Stephen Kiprotich.*\n", "
\n", "Stephen Kiprotich, the 2012 gold medal winner from the London Olympics,\n", "comes from Kapchorwa district, in eastern Uganda, near the border with\n", "Kenya.\n", "\n", "The common standard for collecting health data on the African continent\n", "is from the Health management information systems (HMIS). However, this\n", "data suffers from missing values [@Gething:hmis06] and diagnosis of\n", "diseases like typhoid and malaria may be confounded.\n", "\n", " \n", "
\n", "*The Tororo district, where the sentinel site, Nagongera is located*\n", "
\n", "[World Health Organization Sentinel Surveillance\n", "systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)\n", "are set up \"when high-quality data are needed about a particular disease\n", "that cannot be obtained through a passive system\". Several sentinel\n", "sites give accurate assessment of malaria disease levels in Uganda,\n", "including a site in Nagongera.\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*Sentinel and HMIS data along with rainfall and temperature for the\n", "Nagongera sentinel station in the Tororo district.*\n", "
\n", "In collaboration with the AI Research Group at Makerere we chose to\n", "investigate whether Gaussian process models could be used to assimilate\n", "information from these two different sources of disease informaton.\n", "Further, we were interested in whether local information on rainfall and\n", "temperature could be used to improve malaria estimates.\n", "\n", "The aim of the project was to use WHO Sentinel sites, alongside rainfall\n", "and temperature, to improve predictions from HMIS data of levels of\n", "malaria.\n", "\n", " \n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "{\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*The project arose out of the Gaussian process summer school held at\n", "Makerere in Kampala in 2013. The school led, in turn, to the Data\n", "Science Africa initiative.*\n", "
\n", "## Early Warning Systems\n", "\n", " \n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*Estimate of the current disease situation in the Kabarole district over\n", "time. Estimate is constructed with a Gaussian process with an additive\n", "covariance funciton.*\n", "
\n", "Health monitoring system for the Kabarole district. Here we have fitted\n", "the reports with a Gaussian process with an additive covariance\n", "function. It has two components, one is a long time scale component (in\n", "red above) the other is a short time scale component (in blue).\n", "\n", "Monitoring proceeds by considering two aspects of the curve. Is the blue\n", "line (the short term report signal) above the red (which represents the\n", "long term trend? If so we have higher than expected reports. If this is\n", "the case *and* the gradient is still positive (i.e. reports are going\n", "up) we encode this with a *red* color. If it is the case and the\n", "gradient of the blue line is negative (i.e. reports are going down) we\n", "encode this with an *amber* color. Conversely, if the blue line is below\n", "the red *and* decreasing, we color *green*. On the other hand if it is\n", "below red but increasing, we color *yellow*.\n", "\n", "This gives us an early warning system for disease. Red is a bad\n", "situation getting worse, amber is bad, but improving. Green is good and\n", "getting better and yellow good but degrading.\n", "\n", "Finally, there is a gray region which represents when the scale of the\n", "effect is small.\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*The map of Ugandan districts with an overview of the Malaria situation\n", "in each district.*\n", "
\n", "These colors can now be observed directly on a spatial map of the\n", "districts to give an immediate impression of the current status of the\n", "disease across the country.\n", "\n", "## Additive Covariance [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "An additive covariance function is derived from considering the result\n", "of summing two Gaussian processes together. If the first Gaussian\n", "process is $g(\\cdot)$, governed by covariance\n", "$\\kernelScalar_g(\\cdot, \\cdot)$ and the second process is $h(\\cdot)$,\n", "governed by covariance $\\kernelScalar_h(\\cdot, \\cdot)$ then the combined\n", "process $f(\\cdot) = g(\\cdot) + h(\\cdot)$ is govererned by a covariance\n", "function, $$\n", "\\kernelScalar_f(\\inputVector, \\inputVector^\\prime) = \\kernelScalar_g(\\inputVector, \\inputVector^\\prime) + \\kernelScalar_h(\\inputVector, \\inputVector^\\prime)\n", "$$\n", "\n", "
\n", "$$\\kernelScalar_f(\\inputVector, \\inputVector^\\prime) = \\kernelScalar_g(\\inputVector, \\inputVector^\\prime) + \\kernelScalar_h(\\inputVector, \\inputVector^\\prime)$$\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "
\n", "
\n", "*An additive covariance function formed by combining two exponentiated\n", "quadratic covariance functions.*\n", "
\n", "## Analysis of US Birth Rates [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "*This is a retrospective analysis of US births by Aki Vehtari. The\n", "challenges of forecasting. Even with seasonal and weekly effects removed\n", "there are significant effects on holidays, weekends, etc.*\n", "
\n", "There's a nice analysis of US birth rates by Gaussian processes with\n", "additive covariances in @Gelman:bayesian13. A combination of covariance\n", "functions are used to take account of weekly and yearly trends. The\n", "analysis is summarized on the cover of the book.\n", "\n", "## Basis Function Covariance [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "The fixed basis function covariance just comes from the properties of a\n", "multivariate Gaussian, if we decide $$\n", "\\mappingFunctionVector=\\basisMatrix\\mappingVector\n", "$$ and then we assume $$\n", "\\mappingVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\eye}\n", "$$ then it follows from the properties of a multivariate Gaussian that\n", "$$\n", "\\mappingFunctionVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\basisMatrix\\basisMatrix^\\top}\n", "$$ meaning that the vector of observations from the function is jointly\n", "distributed as a Gaussian process and the covariance matrix is\n", "$\\kernelMatrix = \\alpha\\basisMatrix \\basisMatrix^\\top$, each element of\n", "the covariance matrix can then be found as the inner product between two\n", "rows of the basis funciton matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s basis_cov mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s radial mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "$$\\kernel(\\inputVector, \\inputVector^\\prime) = \\basisVector(\\inputVector)^\\top \\basisVector(\\inputVector^\\prime)$$\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "
\n", "
\n", "*A covariance function based on a non-linear basis given by\n", "$\\basisVector(\\inputVector)$.*\n", "
\n", "## Brownian Covariance [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s brownian_cov mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brownian motion is also a Gaussian process. It follows a Gaussian random\n", "walk, with diffusion occuring at each time point driven by a Gaussian\n", "input. This implies it is both Markov and Gaussian. The covariance\n", "function for Brownian motion has the form $$\n", "\\kernelScalar(t, t^\\prime)=\\alpha \\min(t, t^\\prime)\n", "$$\n", "\n", "
\n", "$$\\kernelScalar(t, t^\\prime)=\\alpha \\min(t, t^\\prime)$$\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "
\n", "## MLP Covariance [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s mlp_cov mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The multi-layer perceptron (MLP) covariance, also known as the neural\n", "network covariance or the arcsin covariance, is derived by considering\n", "the infinite limit of a neural network.\n", "\n", "
\n", "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\arcsin\\left(\\frac{w \\inputVector^\\top \\inputVector^\\prime + b}{\\sqrt{\\left(w \\inputVector^\\top \\inputVector + b + 1\\right)\\left(w \\left.\\inputVector^\\prime\\right.^\\top \\inputVector^\\prime + b + 1\\right)}}\\right)$$\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
 \n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "
\n", "
\n", "*The multi-layer perceptron covariance function. This is derived by\n", "considering the infinite limit of a neural network with probit\n", "activation functions.*\n", "
\n", "## GPSS: Gaussian Process Summer School [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "
\n", "\n", "\\includesvgclass{../slides/diagrams/logo/gpss-logo.svg}\n", "\n", "
\n", "\n", "If you're interested in finding out more about Gaussian processes, you\n", "can attend the Gaussian process summer school, or view the lectures and\n", "material on line. Details of the school, future events and past events\n", "can be found at the website .\n", "\n", "## GPy: A Gaussian Process Framework in Python [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "GPy is a BSD licensed software code base for implementing Gaussian\n", "process models in python. This allows GPs to be combined with a wide\n", "variety of software libraries.\n", "\n", "The software itself is available on\n", "[GitHub](https://github.com/SheffieldML/GPy) and the team welcomes\n", "contributions.\n", "\n", "The aim for GPy is to be a probabilistic-style programming language,\n", "i.e. you specify the model rather than the algorithm. As well as a large\n", "range of covariance functions the software allows for non-Gaussian\n", "likelihoods, multivariate outputs, dimensionality reduction and\n", "approximations for larger data sets.\n", "\n", "## Other Software [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "GPy has inspired other software solutions, first of all\n", "[GPflow](https://github.com/GPflow/GPflow), which uses Tensor Flow's\n", "automatic differentiation engine to allow rapid prototyping of new\n", "covariance functions and algorithms. More recently,\n", "[GPyTorch](https://github.com/cornellius-gp/gpytorch) uses PyTorch for\n", "the same purpose.\n", "\n", "GPy itself is being restructured with MXFusion as its computational\n", "engine to give similiar capabilities.\n", "\n", "## MXFusion: Modular Probabilistic Programming on MXNet [\${.editsection style=\"\"}\$]{.editsection-bracket style=\"\"}\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "\n", " \n", "\n", "
 \n", "- Work by Eric Meissner and Zhenwen Dai.\n", "- Probabilistic programming.\n", "- Available on [Github](https://github.com/amzn/mxfusion)\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " \n", "\n", "\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acknowledgments\n", "\n", "Stefanos Eleftheriadis, John Bronskill, Hugh Salimbeni, Rich Turner,\n", "Zhenwen Dai, Javier Gonzalez, Andreas Damianou, Mark Pullin, Michael\n", "Smith, James Hensman, John Quinn, Martin Mubangizi.\n", "\n", "# References {#references .unnumbered}\n", "\n", "[^1]: In classical statistics we often interpret these parameters,\n", " $\\beta$, whereas in machine learning we are normally more interested\n", " in the result of the prediction, and less in the prediction.\n", " Although this is changing with more need for accountability. In\n", " honour of this I normally use $\\boldsymbol{\\beta}$ when I care about\n", " the value of these parameters, and $\\mappingVector$ when I care more\n", " about the quality of the prediction." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }