{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Gaussian Processes\n", "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n", "### 2018-09-03\n", "\n", "**Abstract**: In this talk we introduce Gaussian process models. Motivating the\n", "representation of uncertainty through probability distributions we\n", "review Laplace's approach to understanding uncertainty and how\n", "uncertainty in functions can be represented through a multivariate\n", "Gaussian density.\n", "\n", "$$\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "@Rasmussen:book06 is still one of the most important references on\n", "Gaussian process models. It is available freely online.\n", "\n", "## What is Machine Learning?\n", "\n", "What is machine learning? At its most basic level machine learning is a\n", "combination of\n", "\n", "$$\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", "where *data* is our observations. They can be actively or passively\n", "acquired (meta-data). The *model* contains our assumptions, based on\n", "previous experience. That experience can be other data, it can come from\n", "transfer learning, or it can merely be our beliefs about the\n", "regularities of the universe. In humans our models include our inductive\n", "biases. The *prediction* is an action to be taken or a categorization or\n", "a quality score. The reason that machine learning has become a mainstay\n", "of artificial intelligence is the importance of predictions in\n", "artificial intelligence. The data and the model are combined through\n", "computation.\n", "\n", "In practice we normally perform machine learning using two functions. To\n", "combine data with a model we typically make use of:\n", "\n", "**a prediction function** a function which is used to make the\n", "predictions. It includes our beliefs about the regularities of the\n", "universe, our assumptions about how the world works, e.g. smoothness,\n", "spatial similarities, temporal similarities.\n", "\n", "**an objective function** a function which defines the cost of\n", "misprediction. Typically it includes knowledge about the world's\n", "generating processes (probabilistic objectives) or the costs we pay for\n", "mispredictions (empiricial risk minimization).\n", "\n", "The combination of data and model through the prediction function and\n", "the objectie function leads to a *learning algorithm*. The class of\n", "prediction functions and objective functions we can make use of is\n", "restricted by the algorithms they lead to. If the prediction function or\n", "the objective function are too complex, then it can be difficult to find\n", "an appropriate learning algorithm. Much of the acdemic field of machine\n", "learning is the quest for new learning algorithms that allow us to bring\n", "different types of models and data together.\n", "\n", "A useful reference for state of the art in machine learning is the UK\n", "Royal Society Report, [Machine Learning: Power and Promise of Computers\n", "that Learn by\n", "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n", "\n", "You can also check my blog post on [\"What is Machine\n", "Learning?\"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)\n", "\n", "### Olympic Marathon Data\n", "\n", "The first thing we will do is load a standard data set for regression\n", "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n", "winners for the Olympics from 1896 to present. First we load in the data\n", "and plot.\n", "\n", "### Olympic Marathon Data\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "- Gold medal times for Olympic Marathon since 1896.\n", "\n", "- Marathons before 1924 didn’t have a standardised distance.\n", "\n", "- Present results using pace per km.\n", "\n", "- In 1904 Marathon was badly organised leading to very slow times.\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "Image from Wikimedia Commons \n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "Things to notice about the data include the outlier in 1904, in this\n", "year, the olympics was in St Louis, USA. Organizational problems and\n", "challenges with dust kicked up by the cars following the race meant that\n", "participants got lost, and only very few participants completed.\n", "\n", "More recent years see more consistently quick marathons.\n", "\n", "### Overdetermined System\n", "\n", "The challenge with a linear model is that it has two unknowns, $m$, and\n", "$c$. Observing data allows us to write down a system of simultaneous\n", "linear equations. So, for example if we observe two data points, the\n", "first with the input value, $\\inputScalar_1 = 1$ and the output value,\n", "$\\dataScalar_1 =3$ and a second data point, $\\inputScalar = 3$,\n", "$\\dataScalar=1$, then we can write two simultaneous linear equations of\n", "the form.\n", "\n", "point 1: $\\inputScalar = 1$, $\\dataScalar=3$ $$3 = m + c$$ point 2:\n", "$\\inputScalar = 3$, $\\dataScalar=1$ $$1 = 3m + c$$\n", "\n", "The solution to these two simultaneous equations can be represented\n", "graphically as\n", "\n", "\n", "
\n", "The solution of two linear equations represented as the fit of a\n", "straight line through two data\n", "
\n", "The challenge comes when a third data point is observed and it doesn't\n", "naturally fit on the straight line.\n", "\n", "point 3: $\\inputScalar = 2$, $\\dataScalar=2.5$ $$2.5 = 2m + c$$\n", "\n", "\n", "
\n", "A third observation of data is inconsistent with the solution\n", "dictated by the first two observations\n", "
\n", "Now there are three candidate lines, each consistent with our data.\n", "\n", "\n", "
\n", "Three solutions to the problem, each consistent with two points of\n", "the three observations\n", "
\n", "This is known as an *overdetermined* system because there are more data\n", "than we need to determine our parameters. The problem arises because the\n", "model is a simplification of the real world, and the data we observe is\n", "therefore inconsistent with our model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('over_determined_system{samp:0>3}.svg',\n", " directory='../slides/diagrams/ml', \n", " samp=IntSlider(1,1,7,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The solution was proposed by Pierre-Simon Laplace. His idea was to\n", "accept that the model was an incomplete representation of the real\n", "world, and the manner in which it was incomplete is *unknown*. His idea\n", "was that such unknowns could be dealt with through probability.\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "
\n", "Pierre Simon Laplace\n", "
\n", " import pods\n", " pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA2')\n", "\n", "Famously, Laplace considered the idea of a deterministic Universe, one\n", "in which the model is *known*, or as the below translation refers to it,\n", "\"an intelligence which could comprehend all the forces by which nature\n", "is animated\". He speculates on an \"intelligence\" that can submit this\n", "vast data to analysis and propsoses that such an entity would be able to\n", "predict the future.\n", "\n", "> Given for one instant an intelligence which could comprehend all the\n", "> forces by which nature is animated and the respective situation of the\n", "> beings who compose it---an intelligence sufficiently vast to submit\n", "> these data to analysis---it would embrace in the same formulate the\n", "> movements of the greatest bodies of the universe and those of the\n", "> lightest atom; for it, nothing would be uncertain and the future, as\n", "> the past, would be present in its eyes.\n", "\n", "This notion is known as *Laplace's demon* or *Laplace's superman*.\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "Unfortunately, most analyses of his ideas stop at that point, whereas\n", "his real point is that such a notion is unreachable. Not so much\n", "*superman* as *strawman*. Just three pages later in the \"Philosophical\n", "Essay on Probabilities\" [@Laplace:essai14], Laplace goes on to observe:\n", "\n", "> The curve described by a simple molecule of air or vapor is regulated\n", "> in a manner just as certain as the planetary orbits; the only\n", "> difference between them is that which comes from our ignorance.\n", ">\n", "> Probability is relative, in part to this ignorance, in part to our\n", "> knowledge." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA4')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "\n", "
\n", "\n", "In other words, we can never make use of the idealistic deterministc\n", "Universe due to our ignorance about the world, Laplace's suggestion, and\n", "focus in this essay is that we turn to probability to deal with this\n", "uncertainty. This is also our inspiration for using probabilit in\n", "machine learning.\n", "\n", "The \"forces by which nature is animated\" is our *model*, the \"situation\n", "of beings that compose it\" is our *data* and the \"intelligence\n", "sufficiently vast enough to submit these data to analysis\" is our\n", "compute. The fly in the ointment is our *ignorance* about these aspects.\n", "And *probability* is the tool we use to incorporate this ignorance\n", "leading to uncertainty or *doubt* in our predictions.\n", "\n", "Laplace's concept was that the reason that the data doesn't match up to\n", "the model is because of unconsidered factors, and that these might be\n", "well represented through probability densities. He tackles the challenge\n", "of the unknown factors by adding a variable, $\\noiseScalar$, that\n", "represents the unknown. In modern parlance we would call this a *latent*\n", "variable. But in the context Laplace uses it, the variable is so common\n", "that it has other names such as a \"slack\" variable or the *noise* in the\n", "system.\n", "\n", "point 1: $\\inputScalar = 1$, $\\dataScalar=3$ $$\n", "3 = m + c + \\noiseScalar_1\n", "$$ point 2: $\\inputScalar = 3$, $\\dataScalar=1$ $$\n", "1 = 3m + c + \\noiseScalar_2\n", "$$ point 3: $\\inputScalar = 2$, $\\dataScalar=2.5$ $$\n", "2.5 = 2m + c + \\noiseScalar_3\n", "$$\n", "\n", "Laplace's trick has converted the *overdetermined* system into an\n", "*underdetermined* system. He has now added three variables,\n", "$\\{\\noiseScalar_i\\}_{i=1}^3$, which represent the unknown corruptions of\n", "the real world. Laplace's idea is that we should represent that unknown\n", "corruption with a *probability distribution*.\n", "\n", "### A Probabilistic Process\n", "\n", "However, it was left to an admirer of Gauss to develop a practical\n", "probability density for that purpose. It was Carl Friederich Gauss who\n", "suggested that the *Gaussian* density (which at the time was unnamed!)\n", "should be used to represent this error.\n", "\n", "The result is a *noisy* function, a function which has a deterministic\n", "part, and a stochastic part. This type of function is sometimes known as\n", "a probabilistic or stochastic process, to distinguish it from a\n", "deterministic process.\n", "\n", "### The Gaussian Density\n", "\n", "The Gaussian density is perhaps the most commonly used probability\n", "density. It is defined by a *mean*, $\\meanScalar$, and a *variance*,\n", "$\\dataStd^2$. The variance is taken to be the square of the *standard\n", "deviation*, $\\dataStd$.\n", "\n", "$$\\begin{align}\n", " p(\\dataScalar| \\meanScalar, \\dataStd^2) & = \\frac{1}{\\sqrt{2\\pi\\dataStd^2}}\\exp\\left(-\\frac{(\\dataScalar - \\meanScalar)^2}{2\\dataStd^2}\\right)\\\\& \\buildrel\\triangle\\over = \\gaussianDist{\\dataScalar}{\\meanScalar}{\\dataStd^2}\n", " \\end{align}$$\n", "\n", "\n", "\n", "
\n", "The Gaussian PDF with ${\\meanScalar}=1.7$ and variance\n", "${\\dataStd}^2=0.0225$. Mean shown as red line. It could represent the\n", "heights of a population of students.\n", "
\n", "### Two Important Gaussian Properties\n", "\n", "The Gaussian density has many important properties, but for the moment\n", "we'll review two of them.\n", "\n", "### Sum of Gaussians\n", "\n", "If we assume that a variable, $\\dataScalar_i$, is sampled from a\n", "Gaussian density,\n", "\n", "$$\\dataScalar_i \\sim \\gaussianSamp{\\meanScalar_i}{\\sigma_i^2}$$\n", "\n", "Then we can show that the sum of a set of variables, each drawn\n", "independently from such a density is also distributed as Gaussian. The\n", "mean of the resulting density is the sum of the means, and the variance\n", "is the sum of the variances,\n", "\n", "$$\\sum_{i=1}^{\\numData} \\dataScalar_i \\sim \\gaussianSamp{\\sum_{i=1}^\\numData \\meanScalar_i}{\\sum_{i=1}^\\numData \\sigma_i^2}$$\n", "\n", "Since we are very familiar with the Gaussian density and its properties,\n", "it is not immediately apparent how unusual this is. Most random\n", "variables, when you add them together, change the family of density they\n", "are drawn from. For example, the Gaussian is exceptional in this regard.\n", "Indeed, other random variables, if they are independently drawn and\n", "summed together tend to a Gaussian density. That is the [*central limit\n", "theorem*](https://en.wikipedia.org/wiki/Central_limit_theorem) which is\n", "a major justification for the use of a Gaussian density.\n", "\n", "### Scaling a Gaussian\n", "\n", "Less unusual is the *scaling* property of a Gaussian density. If a\n", "variable, $\\dataScalar$, is sampled from a Gaussian density,\n", "\n", "$$\\dataScalar \\sim \\gaussianSamp{\\meanScalar}{\\sigma^2}$$ and we choose\n", "to scale that variable by a *deterministic* value, $\\mappingScalar$,\n", "then the *scaled variable* is distributed as\n", "\n", "$$\\mappingScalar \\dataScalar \\sim \\gaussianSamp{\\mappingScalar\\meanScalar}{\\mappingScalar^2 \\sigma^2}.$$\n", "Unlike the summing properties, where adding two or more random variables\n", "independently sampled from a family of densitites typically brings the\n", "summed variable *outside* that family, scaling many densities leaves the\n", "distribution of that variable in the same *family* of densities. Indeed,\n", "many densities include a *scale* parameter (e.g. the [Gamma\n", "density](https://en.wikipedia.org/wiki/Gamma_distribution)) which is\n", "purely for this purpose. In the Gaussian the standard deviation,\n", "$\\dataStd$, is the scale parameter. To see why this makes sense, let's\n", "consider, $$z \\sim \\gaussianSamp{0}{1},$$ then if we scale by $\\dataStd$\n", "so we have, $\\dataScalar=\\dataStd z$, we can write,\n", "$$\\dataScalar =\\dataStd z \\sim \\gaussianSamp{0}{\\dataStd^2}$$\n", "\n", "### Regression Examples\n", "\n", "Regression involves predicting a real value, $\\dataScalar_i$, given an\n", "input vector, $\\inputVector_i$. For example, the Tecator data involves\n", "predicting the quality of meat given spectral measurements. Or in\n", "radiocarbon dating, the C14 calibration curve maps from radiocarbon age\n", "to age measured through a back-trace of tree rings. Regression has also\n", "been used to predict the quality of board game moves given expert rated\n", "training data.\n", "\n", "## Underdetermined System\n", "\n", "What about the situation where you have more parameters than data in\n", "your simultaneous equation? This is known as an *underdetermined*\n", "system. In fact this set up is in some sense *easier* to solve, because\n", "we don't need to think about introducing a slack variable (although it\n", "might make a lot of sense from a *modelling* perspective to do so).\n", "\n", "The way Laplace proposed resolving an overdetermined system, was to\n", "introduce slack variables, $\\noiseScalar_i$, which needed to be\n", "estimated for each point. The slack variable represented the difference\n", "between our actual prediction and the true observation. This is known as\n", "the *residual*. By introducing the slack variable we now have an\n", "additional $n$ variables to estimate, one for each data point,\n", "$\\{\\noiseScalar_i\\}$. This actually turns the overdetermined system into\n", "an underdetermined system. Introduction of $n$ variables, plus the\n", "original $m$ and $c$ gives us $\\numData+2$ parameters to be estimated\n", "from $n$ observations, which actually makes the system\n", "*underdetermined*. However, we then made a probabilistic assumption\n", "about the slack variables, we assumed that the slack variables were\n", "distributed according to a probability density. And for the moment we\n", "have been assuming that density was the Gaussian,\n", "$$\\noiseScalar_i \\sim \\gaussianSamp{0}{\\dataStd^2},$$ with zero mean and\n", "variance $\\dataStd^2$.\n", "\n", "The follow up question is whether we can do the same thing with the\n", "parameters. If we have two parameters and only one unknown can we place\n", "a probability distribution over the parameters, as we did with the slack\n", "variables? The answer is yes.\n", "\n", "### Underdetermined System" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('under_determined_system{samp:0>3}.svg', \n", " directory='../slides/diagrams/ml', samp=IntSlider(0, 0, 10, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Fit underdetermined system by considering uncertainty\n", "
\n", "Classically, there are two types of uncertainty that we consider. The\n", "first is known as *aleatoric* uncertainty. This is uncertainty we\n", "couldn't resolve even if we wanted to. An example, would be the result\n", "of a football match before it's played, or where a sheet of paper lands\n", "on the floor.\n", "\n", "The second is known as *epistemic* uncertainty. This is uncertainty that\n", "we could, in principle, resolve. We just haven't yet made the\n", "observation. For example, the result of a football match *after* it is\n", "played, or the color of socks that a lecturer is wearing.\n", "\n", "Note, that there isn't a clean difference between the two. It is\n", "arguable, that if we knew enough about a football match, or the physics\n", "of a falling sheet of paper then we might be able to resolve the\n", "uncertainty. The reason we can't is because *chaotic* behaviour means\n", "that a very small change in any of the initial conditions we would need\n", "to resolve can have a large change in downstream effects. By this\n", "argument, the only truly aleatoric uncertainty might be quantum\n", "uncertainty. However, in practice the distinction is often applied.\n", "\n", "In classical statistics, the frequentist approach only treats\n", "*aleatoric* uncertainty with probability. The key philosophical\n", "difference in the *Bayesian* approach is to treat any unknowns through\n", "probability. This approach was formally justified seperately by\n", "@Cox:probability46 and @deFinetti:prevision37.\n", "\n", "The term Bayesian was a mocking term promoted by Fisher, it comes from\n", "the use, by Bayes, of a billiard table formulation to justify the\n", "Bernoulli distribution. Bayes considers a ball landing uniform at random\n", "between two sides of a billiard table. He then considers the outcome of\n", "the Bernoulli as being whether a second ball comes to rest to the right\n", "or left of the original. In this way, the parameter of his Bernoulli\n", "distribution is a *stochastic variable* (the uncertainty in the\n", "parameter is aleatoric). In contrast, when Bernoulli formulates the\n", "distribution he considers a bag of red and black balls. The parameter of\n", "his Bernoulli is the ratio of red balls to total balls, a deterministic\n", "variable.\n", "\n", "Note how this relates to Laplace's demon. Laplace describes the\n", "deterministic universe (\"... for it nothing would be uncertain and the\n", "future, as the past, would be present in its eyes\"), but acknowledges\n", "the impossibility of achieving this in practice, (\" ... the curve\n", "described by a simple molecule of air or vapor is regulated in a manner\n", "just as certain as the planetary orbits; the only difference between\n", "them is that which comes from our ignorance. *Probability* is relative\n", "in part to this ignorance, in part to our knowledge ...)\n", "\n", "### Prior Distribution\n", "\n", "The tradition in Bayesian inference is to place a probability density\n", "over the parameters of interest in your model. This choice is made\n", "regardless of whether you generally believe those parameters to be\n", "stochastic or deterministic in origin. In other words, to a Bayesian,\n", "the modelling treatment does not differentiate between epistemic and\n", "aleatoric uncertainty. For linear regression we could consider the\n", "following Gaussian prior on the intercept parameter,\n", "$$c \\sim \\gaussianSamp{0}{\\alpha_1}$$ where $\\alpha_1$ is the variance\n", "of the prior distribution, its mean being zero.\n", "\n", "### Posterior Distribution\n", "\n", "The prior distribution is combined with the likelihood of the data given\n", "the parameters $p(\\dataScalar|c)$ to give the posterior via *Bayes'\n", "rule*, $$\n", " p(c|\\dataScalar) = \\frac{p(\\dataScalar|c)p(c)}{p(\\dataScalar)}\n", " $$ where $p(\\dataScalar)$ is the marginal probability of the data,\n", "obtained through integration over the joint density,\n", "$p(\\dataScalar, c)=p(\\dataScalar|c)p(c)$. Overall the equation can be\n", "summarized as, $$\n", " \\text{posterior} = \\frac{\\text{likelihood}\\times \\text{prior}}{\\text{marginal likelihood}}.\n", " $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ipywidgets import IntSlider\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('dem_gaussian{stage:0>2}.svg', \n", " diagrams='../slides/diagrams/ml', \n", " stage=IntSlider(1, 1, 3, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\endanimation\n", "\n", "
\n", "Combining a Gaussian likelihood with a Gaussian prior to form a\n", "Gaussian posterior\n", "
\n", "Another way of seeing what's going on is to note that the numerator of\n", "Bayes' rule merely multiplies the likelihood by the prior. The\n", "denominator, is not a function of $c$. So the functional form is\n", "entirely determined by the multiplication of prior and likelihood. This\n", "has the effect of ensuring that the posterior only has probability mass\n", "in regions where both the prior and the likelihood have probability\n", "mass.\n", "\n", "The marginal likelihood, $p(\\dataScalar)$, operates to ensure that the\n", "distribution is normalised.\n", "\n", "For the Gaussian case, the normalisation of the posterior can be\n", "performed analytically. This is because both the prior and the\n", "likelihood have the form of an *exponentiated quadratic*, $$\n", "\\exp(a^2)\\exp(b^2) = \\exp(a^2 + b^2),\n", "$$ and the properties of the exponential mean that the product of two\n", "exponentiated quadratics is also an exponentiated quadratic. That\n", "implies that the posterior is also Gaussian, because a normalized\n", "exponentiated quadratic is a Gaussian distribution.[^1]\n", "\n", "For general Bayesian inference, over more than one parameter, we need\n", "*multivariate priors*. For example, consider the multivariate linear\n", "regression where an observation, $\\dataScalar_i$ is related to a vector\n", "of features, $\\inputVector_{i, :}$, through a vector of parameters,\n", "$\\weightVector$,\n", "$$\\dataScalar_i = \\sum_j \\weightScalar_j \\inputScalar_{i, j} + \\noiseScalar_i,$$\n", "or in vector notation,\n", "$$\\dataScalar_i = \\weightVector^\\top \\inputVector_{i, :} + \\noiseScalar_i.$$\n", "Here we've dropped the intercpet for convenience, it can be reintroduced\n", "by augmenting the feature vector, $\\inputVector_{i, :}$, with a constant\n", "valued feature.\n", "\n", "This motivates the need for a *multivariate* Gaussian density.\n", "\n", "### Multivariate Regression Likelihood\n", "\n", "- Noise corrupted data point\n", " $$\\dataScalar_i = \\weightVector^\\top \\inputVector_{i, :} + {\\noiseScalar}_i$$\n", "\n", ". . .\n", "\n", "- Multivariate regression likelihood:\n", " $$p(\\dataVector| \\inputMatrix, \\weightVector) = \\frac{1}{\\left(2\\pi {\\dataStd}^2\\right)^{\\numData/2}} \\exp\\left(-\\frac{1}{2{\\dataStd}^2}\\sum_{i=1}^{\\numData}\\left(\\dataScalar_i - \\weightVector^\\top \\inputVector_{i, :}\\right)^2\\right)$$\n", "\n", ". . .\n", "\n", "- Now use a *multivariate* Gaussian prior:\n", " $$p(\\weightVector) = \\frac{1}{\\left(2\\pi \\alpha\\right)^\\frac{\\dataDim}{2}} \\exp \\left(-\\frac{1}{2\\alpha} \\weightVector^\\top \\weightVector\\right)$$\n", "\n", "### Two Dimensional Gaussian\n", "\n", "Consider the distribution of height (in meters) of an adult male human\n", "population. We will approximate the marginal density of heights as a\n", "Gaussian density with mean given by $1.7\\text{m}$ and a standard\n", "deviation of $0.15\\text{m}$, implying a variance of $\\dataStd^2=0.0225$,\n", "$$\n", " p(h) \\sim \\gaussianSamp{1.7}{0.0225}.\n", " $$ Similarly, we assume that weights of the population are distributed\n", "a Gaussian density with a mean of $75 \\text{kg}$ and a standard\n", "deviation of $6 kg$ (implying a variance of 36), $$\n", " p(w) \\sim \\gaussianSamp{75}{36}.\n", " $$\n", "\n", "\n", "\n", "
\n", "Gaussian distributions for height and weight.\n", "
\n", "### Independence Assumption\n", "\n", "First of all, we make an independence assumption, we assume that height\n", "and weight are independent. The definition of probabilistic independence\n", "is that the joint density, $p(w, h)$, factorizes into its marginal\n", "densities, $$\n", " p(w, h) = p(w)p(h).\n", " $$ Given this assumption we can sample from the joint distribution by\n", "independently sampling weights and heights." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('independent_height_weight{fig:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " fig=IntSlider(0, 0, 7, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Samples from independent Gaussian variables that might represent\n", "heights and weights.\n", "
\n", "In reality height and weight are *not* independent. Taller people tend\n", "on average to be heavier, and heavier people are likely to be taller.\n", "This is reflected by the *body mass index*. A ratio suggested by one of\n", "the fathers of statistics, Adolphe Quetelet. Quetelet was interested in\n", "the notion of the *average man* and collected various statistics about\n", "people. He defined the BMI to be, $$\n", "\\text{BMI} = \\frac{w}{h^2}\n", "$$To deal with this dependence we now introduce the notion of\n", "*correlation* to the multivariate Gaussian density.\n", "\n", "### Sampling Two Dimensional Variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "pods.notebook.display_plots('correlated_height_weight{fig:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " fig=IntSlider(0, 0, 7, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Samples from *correlated* Gaussian variables that might represent\n", "heights and weights.\n", "
\n", "### Independent Gaussians\n", "\n", "$$\n", "p(w, h) = p(w)p(h)\n", "$$\n", "\n", "$$\n", "p(w, h) = \\frac{1}{\\sqrt{2\\pi \\dataStd_1^2}\\sqrt{2\\pi\\dataStd_2^2}} \\exp\\left(-\\frac{1}{2}\\left(\\frac{(w-\\meanScalar_1)^2}{\\dataStd_1^2} + \\frac{(h-\\meanScalar_2)^2}{\\dataStd_2^2}\\right)\\right)\n", "$$\n", "\n", "$$\n", "p(w, h) = \\frac{1}{\\sqrt{2\\pi\\dataStd_1^22\\pi\\dataStd_2^2}} \\exp\\left(-\\frac{1}{2}\\left(\\begin{bmatrix}w \\\\ h\\end{bmatrix} - \\begin{bmatrix}\\meanScalar_1 \\\\ \\meanScalar_2\\end{bmatrix}\\right)^\\top\\begin{bmatrix}\\dataStd_1^2& 0\\\\0&\\dataStd_2^2\\end{bmatrix}^{-1}\\left(\\begin{bmatrix}w \\\\ h\\end{bmatrix} - \\begin{bmatrix}\\meanScalar_1 \\\\ \\meanScalar_2\\end{bmatrix}\\right)\\right)\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi \\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\mathbf{D}^{-1}(\\dataVector - \\meanVector)\\right)\n", "$$\n", "\n", "### Correlated Gaussian\n", "\n", "Form correlated from original by rotating the data space using matrix\n", "$\\rotationMatrix$.\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\mathbf{D}^{-1}(\\dataVector - \\meanVector)\\right)\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\rotationMatrix^\\top\\dataVector - \\rotationMatrix^\\top\\meanVector)^\\top\\mathbf{D}^{-1}(\\rotationMatrix^\\top\\dataVector - \\rotationMatrix^\\top\\meanVector)\\right)\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\rotationMatrix\\mathbf{D}^{-1}\\rotationMatrix^\\top(\\dataVector - \\meanVector)\\right)\n", "$$ this gives a covariance matrix: $$\n", "\\covarianceMatrix^{-1} = \\rotationMatrix \\mathbf{D}^{-1} \\rotationMatrix^\\top\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\covarianceMatrix}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\covarianceMatrix^{-1} (\\dataVector - \\meanVector)\\right)\n", "$$ this gives a covariance matrix: $$\n", "\\covarianceMatrix = \\rotationMatrix \\mathbf{D} \\rotationMatrix^\\top\n", "$$\n", "\n", "Let's first of all review the properties of the multivariate Gaussian\n", "distribution that make linear Gaussian models easier to deal with. We'll\n", "return to the, perhaps surprising, result on the parameters within the\n", "nonlinearity, $\\parameterVector$, shortly.\n", "\n", "To work with linear Gaussian models, to find the marginal likelihood all\n", "you need to know is the following rules. If $$\n", "\\dataVector = \\mappingMatrix \\inputVector + \\noiseVector,\n", "$$ where $\\dataVector$, $\\inputVector$ and $\\noiseVector$ are vectors\n", "and we assume that $\\inputVector$ and $\\noiseVector$ are drawn from\n", "multivariate Gaussians, $$\\begin{align}\n", "\\inputVector & \\sim \\gaussianSamp{\\meanVector}{\\covarianceMatrix}\\\\\n", "\\noiseVector & \\sim \\gaussianSamp{\\zerosVector}{\\covarianceMatrixTwo}\n", "\\end{align}$$ then we know that $\\dataVector$ is also drawn from a\n", "multivariate Gaussian with, $$\n", "\\dataVector \\sim \\gaussianSamp{\\mappingMatrix\\meanVector}{\\mappingMatrix\\covarianceMatrix\\mappingMatrix^\\top + \\covarianceMatrixTwo}.\n", "$$\n", "\n", "With apprioriately defined covariance, $\\covarianceTwoMatrix$, this is\n", "actually the marginal likelihood for Factor Analysis, or Probabilistic\n", "Principal Component Analysis [@Tipping:probpca99], because we integrated\n", "out the inputs (or *latent* variables they would be called in that\n", "case).\n", "\n", "However, we are focussing on what happens in models which are non-linear\n", "in the inputs, whereas the above would be *linear* in the inputs. To\n", "consider these, we introduce a matrix, called the design matrix. We set\n", "each activation function computed at each data point to be $$\n", "\\activationScalar_{i,j} = \\activationScalar(\\mappingVector^{(1)}_{j}, \\inputVector_{i})\n", "$$ and define the matrix of activations (known as the *design matrix* in\n", "statistics) to be, $$\n", "\\activationMatrix = \n", "\\begin{bmatrix}\n", "\\activationScalar_{1, 1} & \\activationScalar_{1, 2} & \\dots & \\activationScalar_{1, \\numHidden} \\\\\n", "\\activationScalar_{1, 2} & \\activationScalar_{1, 2} & \\dots & \\activationScalar_{1, \\numData} \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "\\activationScalar_{\\numData, 1} & \\activationScalar_{\\numData, 2} & \\dots & \\activationScalar_{\\numData, \\numHidden}\n", "\\end{bmatrix}.\n", "$$ By convention this matrix always has $\\numData$ rows and $\\numHidden$\n", "columns, now if we define the vector of all noise corruptions,\n", "$\\noiseVector = \\left[\\noiseScalar_1, \\dots \\noiseScalar_\\numData\\right]^\\top$.\n", "\n", "If we define the prior distribution over the vector $\\mappingVector$ to\n", "be Gaussian, $$\n", "\\mappingVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\eye},\n", "$$\n", "\n", "then we can use rules of multivariate Gaussians to see that, $$\n", "\\dataVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha \\activationMatrix \\activationMatrix^\\top + \\dataStd^2 \\eye}.\n", "$$\n", "\n", "In other words, our training data is distributed as a multivariate\n", "Gaussian, with zero mean and a covariance given by $$\n", "\\kernelMatrix = \\alpha \\activationMatrix \\activationMatrix^\\top + \\dataStd^2 \\eye.\n", "$$\n", "\n", "This is an $\\numData \\times \\numData$ size matrix. Its elements are in\n", "the form of a function. The maths shows that any element, index by $i$\n", "and $j$, is a function *only* of inputs associated with data points $i$\n", "and $j$, $\\dataVector_i$, $\\dataVector_j$.\n", "$\\kernel_{i,j} = \\kernel\\left(\\inputVector_i, \\inputVector_j\\right)$\n", "\n", "If we look at the portion of this function associated only with\n", "$\\mappingFunction(\\cdot)$, i.e. we remove the noise, then we can write\n", "down the covariance associated with our neural network, $$\n", "\\kernel_\\mappingFunction\\left(\\inputVector_i, \\inputVector_j\\right) = \\alpha \\activationVector\\left(\\mappingMatrix_1, \\inputVector_i\\right)^\\top \\activationVector\\left(\\mappingMatrix_1, \\inputVector_j\\right)\n", "$$ so the elements of the covariance or *kernel* matrix are formed by\n", "inner products of the rows of the *design matrix*.\n", "\n", "### Gaussian Process\n", "\n", "This is the essence of a Gaussian process. Instead of making assumptions\n", "about our density over each data point, $\\dataScalar_i$ as i.i.d. we\n", "make a joint Gaussian assumption over our data. The covariance matrix is\n", "now a function of both the parameters of the activation function,\n", "$\\mappingMatrixTwo$, and the input variables, $\\inputMatrix$. This comes\n", "about through integrating out the parameters of the model,\n", "$\\mappingVector$.\n", "\n", "### Basis Functions\n", "\n", "We can basically put anything inside the basis functions, and many\n", "people do. These can be deep kernels [@Cho:deep09] or we can learn the\n", "parameters of a convolutional neural network inside there.\n", "\n", "Viewing a neural network in this way is also what allows us to beform\n", "sensible *batch* normalizations [@Ioffe:batch15].\n", "\n", "### Bayesian Inference by Rejection Sampling\n", "\n", "One view of Bayesian inference is to assume we are given a mechanism for\n", "generating samples, where we assume that mechanism is representing on\n", "accurate view on the way we believe the world works.\n", "\n", "This mechanism is known as our *prior* belief.\n", "\n", "We combine our prior belief with our observations of the real world by\n", "discarding all those samples that are inconsistent with our prior. The\n", "*likelihood* defines mathematically what we mean by inconsistent with\n", "the prior. The higher the noise level in the likelihood, the looser the\n", "notion of consistent.\n", "\n", "The samples that remain are considered to be samples from the\n", "*posterior*.\n", "\n", "This approach to Bayesian inference is closely related to two sampling\n", "techniques known as *rejection sampling* and *importance sampling*. It\n", "is realized in practice in an approach known as *approximate Bayesian\n", "computation* (ABC) or likelihood-free inference.\n", "\n", "In practice, the algorithm is often too slow to be practical, because\n", "most samples will be inconsistent with the data and as a result the\n", "mechanism has to be operated many times to obtain a few posterior\n", "samples.\n", "\n", "However, in the Gaussian process case, when the likelihood also assumes\n", "Gaussian noise, we can operate this mechanims mathematically, and obtain\n", "the posterior density *analytically*. This is the benefit of Gaussian\n", "processes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s compute_kernel mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s exponentiated_quadratic mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('gp_rejection_sample{sample:0>3}.svg', \n", " directory='../slides/diagrams/gp', \n", " sample=IntSlider(1,1,5,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "One view of Bayesian inference is we have a machine for generating\n", "samples (the *prior*), and we discard all samples inconsistent with our\n", "data, leaving the samples of interest (the *posterior*). The Gaussian\n", "process allows us to do this analytically.\n", "
\n", "### Sampling a Function\n", "\n", "We will consider a Gaussian distribution with a particular structure of\n", "covariance matrix. We will generate *one* sample from a 25-dimensional\n", "Gaussian density. $$\n", "\\mappingFunctionVector=\\left[\\mappingFunction_{1},\\mappingFunction_{2}\\dots \\mappingFunction_{25}\\right].\n", "$$ in the figure below we plot these data on the $y$-axis against their\n", "*indices* on the $x$-axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s Kernel mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s polynomial_cov mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s exponentiated_quadratic mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(0, 0, 8, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "A 25 dimensional correlated random variable (values ploted against\n", "index)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(9, 9, 12, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "The joint Gaussian over $\\mappingFunction_1$ and $\\mappingFunction_2$\n", "along with the conditional distribution of $\\mappingFunction_2$ given\n", "$\\mappingFunction_1$\n", "
\n", "### Uluru\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "When viewing these contour plots, I sometimes find it helpful to think\n", "of Uluru, the prominent rock formation in Australia. The rock rises\n", "above the surface of the plane, just like a probability density rising\n", "above the zero line. The rock is three dimensional, but when we view\n", "Uluru from the classical position, we are looking at one side of it.\n", "This is equivalent to viewing the marginal density.\n", "\n", "The joint density can be viewed from above, using contours. The\n", "conditional density is equivalent to *slicing* the rock. Uluru is a holy\n", "rock, so this has to be an imaginary slice. Imagine we cut down a\n", "vertical plane orthogonal to our view point (e.g. coming across our view\n", "point). This would give a profile of the rock, which when renormalized,\n", "would give us the conditional distribution, the value of conditioning\n", "would be the location of the slice in the direction we are facing.\n", "\n", "### Prediction with Correlated Gaussians\n", "\n", "Of course in practice, rather than manipulating mountains physically,\n", "the advantage of the Gaussian density is that we can perform these\n", "manipulations mathematically.\n", "\n", "Prediction of $\\mappingFunction_2$ given $\\mappingFunction_1$ requires\n", "the *conditional density*,\n", "$p(\\mappingFunction_2|\\mappingFunction_1)$.Another remarkable property\n", "of the Gaussian density is that this conditional distribution is *also*\n", "guaranteed to be a Gaussian density. It has the form, $$\n", " p(\\mappingFunction_2|\\mappingFunction_1) = \\gaussianDist{\\mappingFunction_2}{\\frac{\\kernelScalar_{1, 2}}{\\kernelScalar_{1, 1}}\\mappingFunction_1}{ \\kernelScalar_{2, 2} - \\frac{\\kernelScalar_{1,2}^2}{\\kernelScalar_{1,1}}}\n", " $$where we have assumed that the covariance of the original joint\n", "density was given by $$\n", " \\kernelMatrix = \\begin{bmatrix} \\kernelScalar_{1, 1} & \\kernelScalar_{1, 2}\\\\ \\kernelScalar_{2, 1} & \\kernelScalar_{2, 2}.\\end{bmatrix}\n", " $$\n", "\n", "Using these formulae we can determine the conditional density for any of\n", "the elements of our vector $\\mappingFunctionVector$. For example, the\n", "variable $\\mappingFunction_8$ is less correlated with\n", "$\\mappingFunction_1$ than $\\mappingFunction_2$. If we consider this\n", "variable we see the conditional density is more diffuse." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(13, 13, 17, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "The joint Gaussian over $\\mappingFunction_1$ and $\\mappingFunction_8$\n", "along with the conditional distribution of $\\mappingFunction_8$ given\n", "$\\mappingFunction_1$\n", "
\n", "### Where Did This Covariance Matrix Come From?\n", "\n", "$$\n", "k(\\inputVector, \\inputVector^\\prime) = \\alpha \\exp\\left(-\\frac{\\left\\Vert \\inputVector - \\inputVector^\\prime\\right\\Vert^2_2}{2\\lengthScale^2}\\right)$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('computing_eq_three_covariance{sample:0>3}.svg', \n", " directory='../slides/diagrams/kern', \n", " sample=IntSlider(0, 0, 16, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Entrywise fill in of the covariance matrix from the covariance\n", "function.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('computing_eq_four_covariance{sample:0>3}.svg', \n", " directory='../slides/diagrams/kern', \n", " sample=IntSlider(0, 0, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Entrywise fill in of the covariance matrix from the covariance\n", "function.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('computing_eq_three_2_covariance{sample:0>3}.svg', \n", " directory='../slides/diagrams/kern', \n", " sample=IntSlider(0, 0, 16, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Entrywise fill in of the covariance matrix from the covariance\n", "function.\n", "
\n", "### Polynomial Covariance\n", "\n", "\\loadplotcode{polynomial_cov}{mlai}\n", "\n", "
\n", "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha(w \\inputVector^\\top\\inputVector^\\prime + b)^d$$\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "### Brownian Covariance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s brownian_cov mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brownian motion is also a Gaussian process. It follows a Gaussian random\n", "walk, with diffusion occuring at each time point driven by a Gaussian\n", "input. This implies it is both Markov and Gaussian. The covariane\n", "function for Brownian motion has the form $$\n", "\\kernelScalar(t, t^\\prime) = \\alpha \\min(t, t^\\prime)\n", "$$\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "### Periodic Covariance\n", "\n", "\\loadplotcode{periodic_cov}{mlai}\n", "\n", "
\n", "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha\\exp\\left(\\frac{-2\\sin(\\pi rw)^2}{\\lengthScale^2}\\right)$$\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "Any linear basis function can also be incorporated into a covariance\n", "function. For example, an RBF network is a type of neural network with a\n", "set of radial basis functions. Meaning, the basis funciton is radially\n", "symmetric. These basis functions take the form, $$\n", "\\basisFunction_k(\\inputScalar) = \\exp\\left(-\\frac{\\ltwoNorm{\\inputScalar-\\meanScalar_k}^{2}}{\\lengthScale^{2}}\\right).\n", "$$ Given a set of parameters, $$\n", "\\meanVector = \\begin{bmatrix} -1 \\\\ 0 \\\\ 1\\end{bmatrix},\n", "$$ we can construct the corresponding covariance function, which has the\n", "form, $$\n", "\\kernelScalar\\left(\\inputVals,\\inputVals^{\\prime}\\right)=\\alpha\\basisVector(\\inputVals)^\\top \\basisVector(\\inputVals^\\prime).\n", "$$\n", "\n", "### Basis Function Covariance\n", "\n", "The fixed basis function covariance just comes from the properties of a\n", "multivariate Gaussian, if we decide $$\n", "\\mappingFunctionVector=\\basisMatrix\\mappingVector\n", "$$ and then we assume $$\n", "\\mappingVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\eye}\n", "$$ then it follows from the properties of a multivariate Gaussian that\n", "$$\n", "\\mappingFunctionVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\basisMatrix\\basisMatrix^\\top}\n", "$$ meaning that the vector of observations from the function is jointly\n", "distributed as a Gaussian process and the covariance matrix is\n", "$\\kernelMatrix = \\alpha\\basisMatrix \\basisMatrix^\\top$, each element of\n", "the covariance matrix can then be found as the inner product between two\n", "rows of the basis funciton matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s basis_cov mlai.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s radial mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "$$\\kernel(\\inputVector, \\inputVector^\\prime) = \\basisVector(\\inputVector)^\\top \\basisVector(\\inputVector^\\prime)$$\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "{\n", "
\n", "A covariance function based on a non-linear basis given by\n", "$\\basisVector(\\inputVector)$.\n", "
\n", "### Selecting Number and Location of Basis\n", "\n", "In practice for a basis function model we need to choose both 1. the\n", "location of the basis functions 2. the number of basis functions\n", "\n", "One very clever of finessing this problem is to choose to have\n", "*infinite* basis functions and place them *everywhere*. To show how this\n", "is possible, we will consider a one dimensional system, $\\inputScalar$,\n", "which should give the intuition of how to do this. However, these ideas\n", "also extend to multidimensional systems as shown in, for example,\n", "@Williams:infinite96 and @Neal:thesis94.\n", "\n", "We consider a one dimensional set up with exponentiated quadratic basis\n", "functions, $$\n", "\\basisFunction_k(\\inputScalar_i) = \\exp\\left(\\frac{\\ltwoNorm{\\inputScalar_i - \\locationScalar_k}^2}{2\\rbfWidth^2}\\right)\n", "$$\n", "\n", "To place these basis functions, we first define the basis function\n", "centers in terms of a starting point on the left of our input, $a$, and\n", "a finishing point, $b$. The gap between basis is given by\n", "$\\Delta\\locationScalar$. The location of each basis is then given by\n", "$$\\locationScalar_k = a+\\Delta\\locationScalar\\cdot (k-1).$$ The\n", "covariance function can then be given as $$\n", "\\kernelScalar\\left(\\inputScalar_i,\\inputScalar_j\\right) = \\sum_{k=1}^\\numBasisFunc \\basisFunction_k(\\inputScalar_i)\\basisFunction_k(\\inputScalar_j)\n", "$$ $$\\begin{aligned}\n", " \\kernelScalar\\left(\\inputScalar_i,\\inputScalar_j\\right) = &\\alpha^\\prime\\Delta\\locationScalar \\sum_{k=1}^{\\numBasisFunc} \\exp\\Bigg(\n", " -\\frac{\\inputScalar_i^2 + \\inputScalar_j^2}{2\\rbfWidth^2}\\\\ \n", " & - \\frac{2\\left(a+\\Delta\\locationScalar\\cdot (k-1)\\right)\n", " \\left(\\inputScalar_i+\\inputScalar_j\\right) + 2\\left(a+\\Delta\\locationScalar \\cdot (k-1)\\right)^2}{2\\rbfWidth^2} \\Bigg)\n", " \\end{aligned}$$ where we've also scaled the variance of the process by\n", "$\\Delta\\locationScalar$.\n", "\n", "A consequence of our definition is that the first and last basis\n", "function locations are given by $$\n", " \\locationScalar_1=a \\ \\text{and}\\ \\locationScalar_\\numBasisFunc=b \\ \\text{so}\\ b= a+ \\Delta\\locationScalar\\cdot(\\numBasisFunc-1)\n", " $$ This implies that the distance between $b$ and $a$ is given by $$\n", " b-a = \\Delta\\locationScalar (\\numBasisFunc -1)\n", " $$ and since the basis functions are separated by\n", "$\\Delta\\locationScalar$ the number of basis functions is given by $$\n", " \\numBasisFunc = \\frac{b-a}{\\Delta \\locationScalar} + 1\n", " $$ The next step is to take the limit as\n", "$\\Delta\\locationScalar\\rightarrow 0$ so\n", "$\\numBasisFunc \\rightarrow \\infty$ where we have used\n", "$a + k\\cdot\\Delta\\locationScalar\\rightarrow \\locationScalar$.\n", "\n", "Performing the integration gives $$\\begin{aligned}\n", " \\kernelScalar(\\inputScalar_i,&\\inputScalar_j) = \\alpha^\\prime \\sqrt{\\pi\\rbfWidth^2}\n", " \\exp\\left( -\\frac{\\left(\\inputScalar_i-\\inputScalar_j\\right)^2}{4\\rbfWidth^2}\\right)\\\\ &\\times\n", " \\frac{1}{2}\\left[\\text{erf}\\left(\\frac{\\left(b - \\frac{1}{2}\\left(\\inputScalar_i +\n", " \\inputScalar_j\\right)\\right)}{\\rbfWidth} \\right)-\n", " \\text{erf}\\left(\\frac{\\left(a - \\frac{1}{2}\\left(\\inputScalar_i +\n", " \\inputScalar_j\\right)\\right)}{\\rbfWidth} \\right)\\right],\n", " \\end{aligned}$$Now we take the limit as $a\\rightarrow -\\infty$ and\n", "$b\\rightarrow \\infty$\n", "$$\\kernelScalar\\left(\\inputScalar_i,\\inputScalar_j\\right) = \\alpha\\exp\\left(\n", " -\\frac{\\left(\\inputScalar_i-\\inputScalar_j\\right)^2}{4\\rbfWidth^2}\\right).$$\n", "where $\\alpha=\\alpha^\\prime \\sqrt{\\pi\\rbfWidth^2}$.\n", "\n", "In conclusion, an RBF model with infinite basis functions is a Gaussian\n", "process with the exponentiated quadratic covariance function\n", "$$\\kernelScalar\\left(\\inputScalar_i,\\inputScalar_j\\right) = \\alpha \\exp\\left(\n", " -\\frac{\\left(\\inputScalar_i-\\inputScalar_j\\right)^2}{4\\rbfWidth^2}\\right).$$\n", "\n", "Note that while the functional form of the basis function and the\n", "covariance function are similar, in general if we repeated this analysis\n", "for other basis functions the covariance function will have a very\n", "different form. For example the error function, $\\text{erf}(\\cdot)$,\n", "results in an $\\asin(\\cdot)$ form. See @Williams:infinite96 for more\n", "details.\n", "\n", "### MLP Covariance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s mlp_cov mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The multi-layer perceptron (MLP) covariance, also known as the neural\n", "network covariance or the arcsin covariance, is derived by considering\n", "the infinite limit of a neural network.\n", "\n", "
\n", "$$\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\arcsin\\left(\\frac{w \\inputVector^\\top \\inputVector^\\prime + b}{\\sqrt{\\left(w \\inputVector^\\top \\inputVector + b + 1\\right)\\left(w \\left.\\inputVector^\\prime\\right.^\\top \\inputVector^\\prime + b + 1\\right)}}\\right)$$\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "The multi-layer perceptron covariance function. This is derived by\n", "considering the infinite limit of a neural network with probit\n", "activation functions.\n", "
\n", "\n", "### Sinc Covariance\n", "\n", "Another approach to developing covariance function exploits Bochner's\n", "theorem @Bochner:book59. Bochner's theorem tells us that any positve\n", "filter in Fourier space implies has an associated Gaussian process with\n", "a stationary covariance function. The covariance function is the\n", "*inverse Fourier transform* of the filter applied in Fourier space.\n", "\n", "For example, in signal processing, *band limitations* are commonly\n", "applied as an assumption. For example, we may believe that no frequency\n", "above $w=2$ exists in the signal. This is equivalent to a rectangle\n", "function being applied as a the filter in Fourier space.\n", "\n", "The inverse Fourier transform of the rectangle function is the\n", "$\\text{sinc}(\\cdot)$ function. So the sinc is a valid covariance\n", "function, and it represents *band limited* signals.\n", "\n", "Note that other covariance functions we've introduced can also be\n", "interpreted in this way. For example, the exponentiated quadratic\n", "covariance function can be Fourier transformed to see what the implied\n", "filter in Fourier space is. The Fourier transform of the exponentiated\n", "quadratic is an exponentiated quadratic, so the standard EQ-covariance\n", "implies a EQ filter in Fourier space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load -s sinc_cov mlai.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\includecovariane{sinc}{\\kernelScalar(\\inputVector, \\inputVector^\\prime) = \\alpha \\text{sinc}\\left(\\pi w r\\right)}\n", "\n", "\\tiny\n", "\n", "\\bibliographystyle{pdf_abbrvnat}\n", "\n", "[^1]: Note not all exponentiated quadratics can be normalized, to do so,\n", " the coefficient associated with the variable squared,\n", " $\\dataScalar^2$, must be strictly positive." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }