{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian Regression\n", "### [Neil D. Lawrence](http://inverseprobability.com), University of Sheffield\n", "### 2015-11-03\n", "\n", "**Abstract**: Bayesian formalisms deal with uncertainty in parameters,\n", "\n", "$$\n", "\\newcommand{\\tk}[1]{}\n", "%\\newcommand{\\tk}[1]{\\textbf{TK}: #1}\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "## Overdetermined System\n", "\n", "We can motivate the introduction of probability by considering systems\n", "where there were more observations than unknowns. In particular we can\n", "consider the simple fitting of the gradient and an offset of a line, $$ \n", "\\dataScalar = m\\inputScalar +c.\n", "$$ What happens if we have three pairs of observations of $\\inputScalar$\n", "and $\\dataScalar$, $\\{\\inputScalar_i, \\dataScalar_i\\}_{i=1}^3$. The\n", "issue can be solved by introducing a type of [slack\n", "variable](http://en.wikipedia.org/wiki/Slack_variable),\n", "$\\noiseScalar_i$, known as noise, such that for each observation we had\n", "the equation, $$\n", "\\dataScalar_i = m\\inputScalar_i + c + \\noiseScalar_i.\n", "$$\n", "\n", "\n", "# Underdetermined System \\[edit\\]\n", "\n", "What about the situation where you have more parameters than data in\n", "your simultaneous equation? This is known as an *underdetermined*\n", "system. In fact this set up is in some sense *easier* to solve, because\n", "we don't need to think about introducing a slack variable (although it\n", "might make a lot of sense from a *modelling* perspective to do so).\n", "\n", "The way Laplace proposed resolving an overdetermined system, was to\n", "introduce slack variables, $\\noiseScalar_i$, which needed to be\n", "estimated for each point. The slack variable represented the difference\n", "between our actual prediction and the true observation. This is known as\n", "the *residual*. By introducing the slack variable we now have an\n", "additional $n$ variables to estimate, one for each data point,\n", "$\\{\\noiseScalar_i\\}$. This actually turns the overdetermined system into\n", "an underdetermined system. Introduction of $n$ variables, plus the\n", "original $m$ and $c$ gives us $\\numData+2$ parameters to be estimated\n", "from $n$ observations, which actually makes the system\n", "*underdetermined*. However, we then made a probabilistic assumption\n", "about the slack variables, we assumed that the slack variables were\n", "distributed according to a probability density. And for the moment we\n", "have been assuming that density was the Gaussian,\n", "$$\\noiseScalar_i \\sim \\gaussianSamp{0}{\\dataStd^2},$$ with zero mean and\n", "variance $\\dataStd^2$.\n", "\n", "The follow up question is whether we can do the same thing with the\n", "parameters. If we have two parameters and only one unknown can we place\n", "a probability distribution over the parameters, as we did with the slack\n", "variables? The answer is yes.\n", "\n", "## Underdetermined System" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('under_determined_system{samp:0>3}.svg', \n", " directory='../slides/diagrams/ml', samp=IntSlider(0, 0, 10, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: An underdetermined system can be fit by considering\n", "uncertainty. Multiple solutions are consistent with one specified\n", "point.\n", "\n", "## A Philosophical Dispute: Probabilistic Treatment of Parameters? \\[edit\\]\n", "\n", "From a philosophical perspective placing a probability distribution over\n", "the *parameters* is known as the *Bayesian* approach. This is because\n", "Thomas Bayes, in a [1763\n", "essay](http://en.wikipedia.org/wiki/An_Essay_towards_solving_a_Problem_in_the_Doctrine_of_Chances)\n", "published at the Royal Society introduced the [Bernoulli\n", "distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution) with\n", "a probabilistic interpretation for the *parameters*. Later statisticians\n", "such as [Ronald Fisher](http://en.wikipedia.org/wiki/Ronald_Fisher)\n", "objected to the use of probability distributions for *parameters*, and\n", "so in an effort to discredit the approach the referred to it as\n", "Bayesian. However, the earliest practioners of modelling, such as\n", "Laplace applied the approach as the most natural thing to do for dealing\n", "with unknowns (whether they were parameters or variables).\n", "Unfortunately, this dispute led to a split in the modelling community\n", "that still has echoes today. It is known as the Bayesian vs Frequentist\n", "controversy. From my own perspective, I think that it is a false\n", "dichotomy, and that the two approaches are actually complementary. My\n", "own focus research focus is on *modelling* and in that context, the use\n", "of probability is vital. For frequenstist statisticians, such as Fisher,\n", "the emphasis was on the value of the evidence in the data for a\n", "particular hypothesis. This is known as hypothesis testing. The two\n", "approaches can be unified because one of the most important approaches\n", "to hypothesis testing is to [compute the ratio of the\n", "likelihoods](http://en.wikipedia.org/wiki/Likelihood-ratio_test), and\n", "the result of applying a probability distribution to the parameters is\n", "merely to arrive at a different form of the likelihood.\n", "\n", "## The Bayesian Controversy: Philosophical Underpinnings\n", "\n", "A segment from the lecture in 2012 on philsophical underpinnings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('AvlnFnvFw_0')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: The philosophical underpinnings of uncertainty, as discussed\n", "in 2012 MLAI lecture.\n", "\n", "\\addreading{@Bishop:book06}{Section 1.2.3 (pg 21–24)}\n", ")} \\addreading{@Rogers:book11}{Sections 3.1-3.4 (pg 95-117)}\n", "\\addreading{@Bishop:book06}{Section 1.2.3 (pg 21–24)}\n", "\\addreading{@Bishop:book06}{Section 1.2.6 (start from just past eq 1.64 pg 30-32)}\n", "\n", "\\reading\n", "## Sum of Squares and Probability\n", "\n", "In the overdetermined system we introduced a new set of slack variables,\n", "$\\{\\noiseScalar_i\\}_{i=1}^\\numData$, on top of our parameters $m$ and\n", "$c$. We dealt with the variables by placing a probability distribution\n", "over them. This gives rise to the likelihood and for the case of\n", "Gaussian distributed variables, it gives rise to the sum of squares\n", "error. It was Gauss who first made this connection in his volume on\n", "\"Theoria Motus Corprum Coelestium\" (written in Latin)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_google_book(id='ORUOAAAAQAAJ', page='213')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The relevant section roughly translates as\n", "\n", "> ... It is clear, that for the product\n", "> $\\Omega = h^\\mu \\pi^{-frac{1}{2}\\mu} e^{-hh(vv + v^\\prime v^\\prime + v^{\\prime\\prime} v^{\\prime\\prime} + \\dots)}$\n", "> to be maximised the sum\n", "> $vv + v ^\\prime v^\\prime + v^{\\prime\\prime} v^{\\prime\\prime} + \\text{etc}.$\n", "> ought to be minimized. *Therefore, the most probable values of the\n", "> unknown quantities $p , q, r , s \\text{etc}.$, should be that in which\n", "> the sum of the squares of the differences between the functions\n", "> $V, V^\\prime, V^{\\prime\\prime} \\text{etc}$, and the observed values is\n", "> minimized*, for all observations of the same degree of precision is\n", "> presumed.\n", "\n", "It's on the strength of this paragraph that the density is known as the\n", "Gaussian, despite the fact that four pages later Gauss credits the\n", "necessary integral for the density to Laplace, and it was also Laplace\n", "that did a lot of the original work on dealing with these errors through\n", "probability. [Stephen Stigler's book on the measurement of uncertainty\n", "before 1900](http://www.hup.harvard.edu/catalog.php?isbn=9780674403413)\n", "has a nice chapter on this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_google_book(id='ORUOAAAAQAAJ', page='217')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where the crediting to the Laplace is about halfway through the last\n", "paragraph. This book was published in 1809, four years after\n", "\\refnotes{Legendre presented least squares}{linear-regression} in an\n", "appendix to one of his chapters on the orbit of comets. Gauss goes on to\n", "make a claim for priority on the method on page 221 (towards the end of\n", "the first paragraph ...)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_google_book(id='ORUOAAAAQAAJ', page='221')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Bayesian Approach \\[edit\\]\n", "\n", "Now we will study Bayesian approaches to regression. In the Bayesian\n", "approach we define a *prior* density over our parameters, $m$ and $c$ or\n", "more generally $\\mappingVector$. This prior distribution gives us a\n", "range of expected values for our parameter *before* we have seen the\n", "data. The object in Bayesian inference is to then compute the*posterior*\n", "density which is the effect on the density of having observed the data.\n", "In standard probability notation we write the prior distribution as, $$\n", "p(\\mappingVector),\n", "$$ so it is the *marginal* distribution for the parameters, i.e. the\n", "distribution we have for the parameters without any knowledge about the\n", "data. The posterior distribution is written as, $$\n", "p(\\mappingVector|\\dataVector, \\inputMatrix).\n", "$$ So the posterior distribution is the *conditional* distribution for\n", "the parameters given the data (which in this case consists of pairs of\n", "observations including response variables (or targets), $\\dataScalar_i$,\n", "and covariates (or inputs) $\\inputVector_i$. Where here we are allowing\n", "the inputs to be multivariate.\n", "\n", "The posterior is recovered from the prior using *Bayes' rule*. Which is\n", "simply a rewriting of the product rule. We can recover Bayes' rule as\n", "follows. The product rule of probability tells us that the joint\n", "distribution is given as the product of the conditional and the\n", "marginal. Dropping the inputs from our conditioning for the moment we\n", "have, $$\n", "p(\\mappingVector, \\dataVector)=p(\\dataVector|\\mappingVector)p(\\mappingVector),\n", "$$ where we see we have related the joint density to the prior density\n", "and the *likelihood* from our previous investigation of regression, $$\n", "p(\\dataVector|\\mappingVector) = \\prod_{i=1}^\\numData\\gaussianDist{\\dataScalar_i}{\\mappingVector^\\top \\inputVector_i}{ \\dataStd^2}\n", "$$ which arises from the assumption that our observation is given by $$\n", "\\dataScalar_i = \\mappingVector^\\top \\inputVector_i + \\noiseScalar_i.\n", "$$ In other words this is the Gaussian likelihood we have been fitting\n", "by minimizing the sum of squares. Have a look at [the session on\n", "multivariate regression](./week3.ipynb) as a reminder.\n", "\n", "We've introduce the likelihood, but we don't have relationship with the\n", "posterior, however, the product rule can also be written in the\n", "following way $$\n", "p(\\mappingVector, \\dataVector) = p(\\mappingVector|\\dataVector)p(\\dataVector),\n", "$$ where here we have simply used the opposite conditioning. We've\n", "already introduced the *posterior* density above. This is the density\n", "that represents our belief about the parameters *after* observing the\n", "data. This is combined with the *marginal likelihood*, sometimes also\n", "known as the evidence. It is the marginal likelihood, because it is the\n", "original likelihood of the data with the parameters marginalised,\n", "$p(\\dataVector)$. Here it's conditioned on nothing, but in practice you\n", "should always remember that everything here is conditioned on things\n", "like model choice: which set of basis functions. Because it's a\n", "regression problem, its also conditioned on the inputs. Using the\n", "equalitybetween the two different forms of the joint density we recover\n", "$$\n", "p(\\mappingVector|\\dataVector) = \\frac{p(\\dataVector|\\mappingVector)p(\\mappingVector)}{p(\\dataVector)}\n", "$$ where we divided both sides by $p(\\dataVector)$ to recover this\n", "result. Let's re-introduce the conditioning on the input locations (or\n", "covariates), $\\inputMatrix$ to write the full form of Bayes' rule for\n", "the regression problem. $$\n", "p(\\mappingVector|\\dataVector, \\inputMatrix) = \\frac{p(\\dataVector|\\mappingVector, \\inputMatrix)p(\\mappingVector)}{p(\\dataVector|\\inputMatrix)}\n", "$$ where the posterior density for the parameters given the data is\n", "$p(\\mappingVector|\\dataVector, \\inputMatrix)$, the marginal likelihood\n", "is $p(\\dataVector|\\inputMatrix)$, the prior density is\n", "$p(\\mappingVector)$ and our original regression likelihood is given by\n", "$p(\\dataVector|\\mappingVector, \\inputMatrix)$. It turns out that to\n", "compute the posterior the only things we need to do are define the prior\n", "and the likelihood. The other term on the right hand side can be\n", "computed by *the sum rule*. It is one of the key equations of Bayesian\n", "inference, the expectation of the likelihood under the prior, this\n", "process is known as marginalisation, $$\n", "p(\\dataVector|\\inputMatrix) = \\int p(\\dataVector|\\mappingVector,\\inputMatrix)p(\\mappingVector) \\text{d}\\mappingVector\n", "$$ I like the term marginalisation, and the description of the\n", "probability as the *marginal likelihood*, because (for me) it somewhat\n", "has the implication that the variable name has been removed, and\n", "(perhaps) written in the margin. Marginalisation of a variable goes from\n", "a likelihood where the variable is in place, to a new likelihood where\n", "all possible values of that variable (under the prior) have been\n", "considered and weighted in the integral. This implies that all we need\n", "for specifying our model is to define the likelihood and the prior. We\n", "already have our likelihood from our earlier discussion, so our focus\n", "now turns to the prior density.\n", "\n", "## Prior Distribution \\[edit\\]\n", "\n", "The tradition in Bayesian inference is to place a probability density\n", "over the parameters of interest in your model. This choice is made\n", "regardless of whether you generally believe those parameters to be\n", "stochastic or deterministic in origin. In other words, to a Bayesian,\n", "the modelling treatment does not differentiate between epistemic and\n", "aleatoric uncertainty. For linear regression we could consider the\n", "following Gaussian prior on the intercept parameter,\n", "$$c \\sim \\gaussianSamp{0}{\\alpha_1}$$ where $\\alpha_1$ is the variance\n", "of the prior distribution, its mean being zero.\n", "\n", "## Posterior Distribution\n", "\n", "The prior distribution is combined with the likelihood of the data given\n", "the parameters $p(\\dataScalar|c)$ to give the posterior via *Bayes'\n", "rule*, $$\n", " p(c|\\dataScalar) = \\frac{p(\\dataScalar|c)p(c)}{p(\\dataScalar)}\n", " $$ where $p(\\dataScalar)$ is the marginal probability of the data,\n", "obtained through integration over the joint density,\n", "$p(\\dataScalar, c)=p(\\dataScalar|c)p(c)$. Overall the equation can be\n", "summarized as, $$\n", " \\text{posterior} = \\frac{\\text{likelihood}\\times \\text{prior}}{\\text{marginal likelihood}}.\n", " $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ipywidgets import IntSlider\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('dem_gaussian{stage:0>2}.svg', \n", " diagrams='../slides/diagrams/ml', \n", " stage=IntSlider(1, 1, 3, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Combining a Gaussian likelihood with a Gaussian prior to form\n", "a Gaussian posterior\n", "\n", "Another way of seeing what's going on is to note that the numerator of\n", "Bayes' rule merely multiplies the likelihood by the prior. The\n", "denominator, is not a function of $c$. So the functional form is\n", "entirely determined by the multiplication of prior and likelihood. This\n", "has the effect of ensuring that the posterior only has probability mass\n", "in regions where both the prior and the likelihood have probability\n", "mass.\n", "\n", "The marginal likelihood, $p(\\dataScalar)$, operates to ensure that the\n", "distribution is normalised.\n", "\n", "For the Gaussian case, the normalisation of the posterior can be\n", "performed analytically. This is because both the prior and the\n", "likelihood have the form of an *exponentiated quadratic*, $$\n", "\\exp(a^2)\\exp(b^2) = \\exp(a^2 + b^2),\n", "$$ and the properties of the exponential mean that the product of two\n", "exponentiated quadratics is also an exponentiated quadratic. That\n", "implies that the posterior is also Gaussian, because a normalized\n", "exponentiated quadratic is a Gaussian distribution.[^1]\n", "\n", "$$p(c) = \\frac{1}{\\sqrt{2\\pi\\alpha_1}} \\exp\\left(-\\frac{1}{2\\alpha_1}c^2\\right)$$\n", "$$p(\\dataVector|\\inputVector, c, m, \\dataStd^2) = \\frac{1}{\\left(2\\pi\\dataStd^2\\right)^{\\frac{\\numData}{2}}} \\exp\\left(-\\frac{1}{2\\dataStd^2}\\sum_{i=1}^\\numData(\\dataScalar_i - m\\inputScalar_i - c)^2\\right)$$\n", "\n", "$$p(c| \\dataVector, \\inputVector, m, \\dataStd^2) = \\frac{p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c)}{p(\\dataVector|\\inputVector, m, \\dataStd^2)}$$\n", "\n", "$$p(c| \\dataVector, \\inputVector, m, \\dataStd^2) = \\frac{p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c)}{\\int p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c) \\text{d} c}$$\n", "\n", "$$p(c| \\dataVector, \\inputVector, m, \\dataStd^2) \\propto p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c)$$\n", "\n", "$$\\begin{aligned}\n", " \\log p(c | \\dataVector, \\inputVector, m, \\dataStd^2) =&-\\frac{1}{2\\dataStd^2} \\sum_{i=1}^\\numData(\\dataScalar_i-c - m\\inputScalar_i)^2-\\frac{1}{2\\alpha_1} c^2 + \\text{const}\\\\\n", " = &-\\frac{1}{2\\dataStd^2}\\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i)^2 -\\left(\\frac{\\numData}{2\\dataStd^2} + \\frac{1}{2\\alpha_1}\\right)c^2\\\\\n", " & + c\\frac{\\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i)}{\\dataStd^2},\n", " \\end{aligned}$$\n", "\n", "complete the square of the quadratic form to obtain\n", "$$\\log p(c | \\dataVector, \\inputVector, m, \\dataStd^2) = -\\frac{1}{2\\tau^2}(c - \\mu)^2 +\\text{const},$$\n", "where $\\tau^2 = \\left(\\numData\\dataStd^{-2} +\\alpha_1^{-1}\\right)^{-1}$\n", "and\n", "$\\mu = \\frac{\\tau^2}{\\dataStd^2} \\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i)$.\n", "\n", "## The Joint Density \\[edit\\]\n", "\n", "- Really want to know the *joint* posterior density over the\n", " parameters $c$ *and* $m$.\n", "- Could now integrate out over $m$, but it's easier to consider the\n", " multivariate case.\n", "\n", "## Two Dimensional Gaussian \\[edit\\]\n", "\n", "Consider the distribution of height (in meters) of an adult male human\n", "population. We will approximate the marginal density of heights as a\n", "Gaussian density with mean given by $1.7\\text{m}$ and a standard\n", "deviation of $0.15\\text{m}$, implying a variance of $\\dataStd^2=0.0225$,\n", "$$\n", " p(h) \\sim \\gaussianSamp{1.7}{0.0225}.\n", " $$ Similarly, we assume that weights of the population are distributed\n", "a Gaussian density with a mean of $75 \\text{kg}$ and a standard\n", "deviation of $6 kg$ (implying a variance of 36), $$\n", " p(w) \\sim \\gaussianSamp{75}{36}.\n", " $$\n", "\n", "\n", "\n", "Figure: Gaussian distributions for height and weight.\n", "\n", "## Independence Assumption\n", "\n", "First of all, we make an independence assumption, we assume that height\n", "and weight are independent. The definition of probabilistic independence\n", "is that the joint density, $p(w, h)$, factorizes into its marginal\n", "densities, $$\n", " p(w, h) = p(w)p(h).\n", " $$ Given this assumption we can sample from the joint distribution by\n", "independently sampling weights and heights." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('independent_height_weight{fig:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " fig=IntSlider(0, 0, 7, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Samples from independent Gaussian variables that might\n", "represent heights and weights.\n", "\n", "In reality height and weight are *not* independent. Taller people tend\n", "on average to be heavier, and heavier people are likely to be taller.\n", "This is reflected by the *body mass index*. A ratio suggested by one of\n", "the fathers of statistics, Adolphe Quetelet. Quetelet was interested in\n", "the notion of the *average man* and collected various statistics about\n", "people. He defined the BMI to be, $$\n", "\\text{BMI} = \\frac{w}{h^2}\n", "$$To deal with this dependence we now introduce the notion of\n", "*correlation* to the multivariate Gaussian density.\n", "\n", "## Sampling Two Dimensional Variables \\[edit\\]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "pods.notebook.display_plots('correlated_height_weight{fig:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " fig=IntSlider(0, 0, 7, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Samples from *correlated* Gaussian variables that might\n", "represent heights and weights.\n", "\n", "## Independent Gaussians \\[edit\\]\n", "\n", "$$\n", "p(w, h) = p(w)p(h)\n", "$$\n", "\n", "$$\n", "p(w, h) = \\frac{1}{\\sqrt{2\\pi \\dataStd_1^2}\\sqrt{2\\pi\\dataStd_2^2}} \\exp\\left(-\\frac{1}{2}\\left(\\frac{(w-\\meanScalar_1)^2}{\\dataStd_1^2} + \\frac{(h-\\meanScalar_2)^2}{\\dataStd_2^2}\\right)\\right)\n", "$$\n", "\n", "$$\n", "p(w, h) = \\frac{1}{\\sqrt{2\\pi\\dataStd_1^22\\pi\\dataStd_2^2}} \\exp\\left(-\\frac{1}{2}\\left(\\begin{bmatrix}w \\\\ h\\end{bmatrix} - \\begin{bmatrix}\\meanScalar_1 \\\\ \\meanScalar_2\\end{bmatrix}\\right)^\\top\\begin{bmatrix}\\dataStd_1^2& 0\\\\0&\\dataStd_2^2\\end{bmatrix}^{-1}\\left(\\begin{bmatrix}w \\\\ h\\end{bmatrix} - \\begin{bmatrix}\\meanScalar_1 \\\\ \\meanScalar_2\\end{bmatrix}\\right)\\right)\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi \\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\mathbf{D}^{-1}(\\dataVector - \\meanVector)\\right)\n", "$$\n", "\n", "## Correlated Gaussian\n", "\n", "Form correlated from original by rotating the data space using matrix\n", "$\\rotationMatrix$.\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\mathbf{D}^{-1}(\\dataVector - \\meanVector)\\right)\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\rotationMatrix^\\top\\dataVector - \\rotationMatrix^\\top\\meanVector)^\\top\\mathbf{D}^{-1}(\\rotationMatrix^\\top\\dataVector - \\rotationMatrix^\\top\\meanVector)\\right)\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\rotationMatrix\\mathbf{D}^{-1}\\rotationMatrix^\\top(\\dataVector - \\meanVector)\\right)\n", "$$ this gives a covariance matrix: $$\n", "\\covarianceMatrix^{-1} = \\rotationMatrix \\mathbf{D}^{-1} \\rotationMatrix^\\top\n", "$$\n", "\n", "$$\n", "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\covarianceMatrix}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\covarianceMatrix^{-1} (\\dataVector - \\meanVector)\\right)\n", "$$ this gives a covariance matrix: $$\n", "\\covarianceMatrix = \\rotationMatrix \\mathbf{D} \\rotationMatrix^\\top\n", "$$\n", "\n", "## The Prior Density\n", "\n", "Let's assume that the prior density is given by a zero mean Gaussian,\n", "which is independent across each of the parameters, $$\n", "\\mappingVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha \\eye}\n", "$$ In other words, we are assuming, for the prior, that each element of\n", "the parameters vector, $\\mappingScalar_i$, was drawn from a Gaussian\n", "density as follows $$\n", "\\mappingScalar_i \\sim \\gaussianSamp{0}{\\alpha}\n", "$$ Let's start by assigning the parameter of the prior distribution,\n", "which is the variance of the prior distribution, $\\alpha$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set prior variance on w\n", "alpha = 4.\n", "# set the order of the polynomial basis set\n", "order = 5\n", "# set the noise variance\n", "sigma2 = 0.01" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\addreading{@Bishop:book06}{Multivariate Gaussians: Section 2.3 up to top of pg 85}\n", "\\addreading{@Bishop:book06}{Section 3.3 up to 159 (pg 152–159)}\n", "\\reading\n", "## Generating from the Model \\[edit\\]\n", "\n", "A very important aspect of probabilistic modelling is to *sample* from\n", "your model to see what type of assumptions you are making about your\n", "data. In this case that involves a two stage process.\n", "\n", "1. Sample a candiate parameter vector from the prior.\n", "2. Place the candidate parameter vector in the likelihood and sample\n", " functions conditiond on that candidate vector.\n", "3. Repeat to try and characterise the type of functions you are\n", " generating.\n", "\n", "Given a prior variance (as defined above) we can now sample from the\n", "prior distribution and combine with a basis set to see what assumptions\n", "we are making about the functions *a priori* (i.e. before we've seen the\n", "data). Firstly we compute the basis function matrix. We will do it both\n", "for our training data, and for a range of prediction locations\n", "(`x_pred`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.olympic_marathon_men()\n", "x = data['X']\n", "y = data['Y']\n", "num_data = x.shape[0]\n", "num_pred_data = 100 # how many points to use for plotting predictions\n", "x_pred = np.linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "now let's build the basis matrices. We define the polynomial basis as\n", "follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def polynomial(x, num_basis=2, loc=0., scale=1.):\n", " degree=num_basis-1\n", " degrees = np.arange(degree+1)\n", " return ((x-loc)/scale)**degrees" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loc=1950\n", "scale=1\n", "degree=4\n", "basis = mlai.Basis(polynomial, number=degree+1, loc=loc, scale=scale)\n", "Phi_pred = basis.Phi(x_pred)\n", "Phi = basis.Phi(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling from the Prior\n", "\n", "Now we will sample from the prior to produce a vector $\\mappingVector$\n", "and use it to plot a function which is representative of our belief\n", "*before* we fit the data. To do this we are going to use the properties\n", "of the Gaussian density and a sample from a *standard normal* using the\n", "function `np.random.normal`.\n", "\n", "## Scaling Gaussian-distributed Variables\n", "\n", "First, let's consider the case where we have one data point and one\n", "feature in our basis set. In otherwords $\\mappingFunctionVector$ would\n", "be a scalar, $\\mappingVector$ would be a scalar and $\\basisMatrix$ would\n", "be a scalar. In this case we have $$\n", "\\mappingFunction = \\basisScalar \\mappingScalar\n", "$$ If $\\mappingScalar$ is drawn from a normal density, $$\n", "\\mappingScalar \\sim \\gaussianSamp{\\meanScalar_\\mappingScalar}{c_\\mappingScalar}\n", "$$ and $\\basisScalar$ is a scalar value which we are given, then\n", "properties of the Gaussian density tell us that $$\n", "\\basisScalar \\mappingScalar \\sim \\gaussianSamp{\\basisScalar\\meanScalar_\\mappingScalar}{\\basisScalar^2c_\\mappingScalar}\n", "$$ Let's test this out numerically. First we will draw 200 samples from\n", "a standard normal," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "w_vec = np.random.normal(size=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute the mean of these samples and their variance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('w sample mean is ', w_vec.mean())\n", "print('w sample variance is ', w_vec.var())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are close to zero (the mean) and one (the variance) as you'd\n", "expect. Now compute the mean and variance of the scaled version," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "phi = 7\n", "f_vec = phi*w_vec\n", "print('True mean should be phi*0 = 0.')\n", "print('True variance should be phi*phi*1 = ', phi*phi)\n", "print('f sample mean is ', f_vec.mean())\n", "print('f sample variance is ', f_vec.var())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you increase the number of samples then you will see that the sample\n", "mean and the sample variance begin to converge towards the true mean and\n", "the true variance. Obviously adding an offset to a sample from\n", "`np.random.normal` will change the mean. So if you want to sample from a\n", "Gaussian with mean `mu` and standard deviation `sigma` one way of doing\n", "it is to sample from the standard normal and scale and shift the result,\n", "so to sample a set of $\\mappingScalar$ from a Gaussian with mean\n", "$\\meanScalar$ and variance $\\alpha$,\n", "$$\\mappingScalar \\sim \\gaussianSamp{\\meanScalar}{\\alpha}$$ We can simply\n", "scale and offset samples from the *standard normal*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mu = 4 # mean of the distribution\n", "alpha = 2 # variance of the distribution\n", "w_vec = np.random.normal(size=200)*np.sqrt(alpha) + mu\n", "print('w sample mean is ', w_vec.mean())\n", "print('w sample variance is ', w_vec.var())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the `np.sqrt` is necesssary because we need to multiply by the\n", "standard deviation and we specified the variance as `alpha`. So scaling\n", "and offsetting a Gaussian distributed variable keeps the variable\n", "Gaussian, but it effects the mean and variance of the resulting\n", "variable.\n", "\n", "To get an idea of the overall shape of the resulting distribution, let's\n", "do the same thing with a histogram of the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now re-run this histogram with 100,000 samples and check that the both\n", "histograms look qualitatively Gaussian.\n", "\n", "## Sampling from the Prior\n", "\n", "Let's use this way of constructing samples from a Gaussian to check what\n", "functions look like *a priori*. The process will be as follows. First,\n", "we sample a random vector $K$ dimensional from `np.random.normal`. Then\n", "we scale it by $\\sqrt{\\alpha}$ to obtain a prior sample of\n", "$\\mappingVector$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "K = degree + 1\n", "z_vec = np.random.normal(size=K)\n", "w_sample = z_vec*np.sqrt(alpha)\n", "print(w_sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can combine our sample from the prior with the basis functions to\n", "create a function,\n", "\n", "This shows the recurring problem with the polynomial basis (note the\n", "scale on the left hand side!). Our prior allows relatively large\n", "coefficients for the basis associated with high polynomial degrees.\n", "Because we are operating with input values of around 2000, this leads to\n", "output functions of very high values. The fix we have used for this\n", "before is to rescale our data before we apply the polynomial basis to\n", "it. Above, we set the scale of the basis to 1. Here let's set it to 100\n", "and try again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scale = 100.\n", "basis = mlai.Basis(polynomial, number=degree+1, loc=loc, scale=scale)\n", "Phi_pred = basis.Phi(x_pred)\n", "Phi = basis.Phi(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to recompute the basis functions from above,\n", "\n", "Now let's loop through some samples and plot various functions as\n", "samples from this system,\n", "\n", "The predictions for the mean output can now be computed. We want the\n", "expected value of the predictions under the posterior distribution. In\n", "matrix form, the predictions can be computed as $$\n", "\\mappingFunctionVector = \\basisMatrix \\mappingVector.\n", "$$ This involves a matrix multiplication between a fixed matrix\n", "$\\basisMatrix$ and a vector that is drawn from a distribution\n", "$\\mappingVector$. Because $\\mappingVector$ is drawn from a distribution,\n", "this imples that $\\mappingFunctionVector$ should also be drawn from a\n", "distribution. There are two distributions we are interested in though.\n", "We have just been sampling from the *prior* distribution to see what\n", "sort of functions we get *before* looking at the data. In Bayesian\n", "inference, we need to computer the *posterior* distribution and sample\n", "from that density.\n", "\n", "## Computing the Posterior \\[edit\\]\n", "\n", "We will now attampt to compute the *posterior distribution*. In the\n", "lecture we went through the maths that allows us to compute the\n", "posterior distribution for $\\mappingVector$. This distribution is also\n", "Gaussian, $$\n", "p(\\mappingVector | \\dataVector, \\inputVector, \\dataStd^2) = \\gaussianDist{\\mappingVector}{\\meanVector_\\mappingScalar}{\\covarianceMatrix_\\mappingScalar}\n", "$$ with covariance, $\\covarianceMatrix_\\mappingScalar$, given by $$\n", "\\covarianceMatrix_\\mappingScalar = \\left(\\dataStd^{-2}\\basisMatrix^\\top \\basisMatrix + \\alpha^{-1}\\eye\\right)^{-1}\n", "$$ whilst the mean is given by $$\n", "\\meanVector_\\mappingScalar = \\covarianceMatrix_\\mappingScalar \\dataStd^{-2}\\basisMatrix^\\top \\dataVector\n", "$$ Let's compute the posterior covariance and mean, then we'll sample\n", "from these densities to have a look at the posterior belief about\n", "$\\mappingVector$ once the data has been accounted for. Remember, the\n", "process of Bayesian inference involves combining the prior,\n", "$p(\\mappingVector)$ with the likelihood,\n", "$p(\\dataVector|\\inputVector, \\mappingVector)$ to form the posterior,\n", "$p(\\mappingVector | \\dataVector, \\inputVector)$ through Bayes' rule, $$\n", "p(\\mappingVector|\\dataVector, \\inputVector) = \\frac{p(\\dataVector|\\inputVector, \\mappingVector)p(\\mappingVector)}{p(\\dataVector)}\n", "$$ We've looked at the samples for our function\n", "$\\mappingFunctionVector = \\basisMatrix\\mappingVector$, which forms the\n", "mean of the Gaussian likelihood, under the prior distribution. I.e.\n", "we've sampled from $p(\\mappingVector)$ and multiplied the result by the\n", "basis matrix. Now we will sample from the posterior density,\n", "$p(\\mappingVector|\\dataVector, \\inputVector)$, and check that the new\n", "samples fit do correspond to the data, i.e. we want to check that the\n", "updated distribution includes information from the data set. First we\n", "need to compute the posterior mean and *covariance*.\n", "\n", "## Bayesian Inference in the Univariate Case\n", "\n", "This video talks about Bayesian inference across the single parameter,\n", "the offset $c$, illustrating how the prior and the likelihood combine in\n", "one dimension to form a posterior." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('AvlnFnvFw_0')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: Univariate Bayesian inference. Lecture 10 from 2012 MLAI\n", "Course.\n", "\n", "## Multivariate Bayesian Inference\n", "\n", "This section of the lecture talks about how we extend the idea of\n", "Bayesian inference for the multivariate case. It goes through the\n", "multivariate Gaussian and how to complete the square in the linear\n", "algebra as we managed below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('Os1iqgpelPw')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: Multivariate Bayesian inference. Lecture 11 from 2012 MLAI\n", "course.\n", "\n", "The lecture informs us the the posterior density for $\\mappingVector$ is\n", "given by a Gaussian density with covariance $$\n", "\\covarianceMatrix_w = \\left(\\dataStd^{-2}\\basisMatrix^\\top \\basisMatrix + \\alpha^{-1}\\eye\\right)^{-1}\n", "$$ and mean $$\n", "\\meanVector_w = \\covarianceMatrix_w\\dataStd^{-2}\\basisMatrix^\\top \\dataVector.\n", "$$\n", "\n", "### Question 1\n", "\n", "Compute the covariance for $\\mappingVector$ given the training data,\n", "call the resulting variable `w_cov`. Compute the mean for\n", "$\\mappingVector$ given the training data. Call the resulting variable\n", "`w_mean`. Assume that $\\dataStd^2 = 0.01$\n", "\n", "*10 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 1 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n", "sigma2 = \n", "w_cov = \n", "w_mean = \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Olympic Marathon Data\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "- Gold medal times for Olympic Marathon since 1896.\n", "- Marathons before 1924 didn't have a standardised distance.\n", "- Present results using pace per km.\n", "- In 1904 Marathon was badly organised leading to very slow times.\n", "\n", "\n", "\n", "Image from Wikimedia Commons \n", "
\n", "The first thing we will do is load a standard data set for regression\n", "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n", "winners for the Olympics from 1896 to present. First we load in the data\n", "and plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.olympic_marathon_men()\n", "x = data['X']\n", "y = data['Y']\n", "\n", "offset = y.mean()\n", "scale = np.sqrt(y.var())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import teaching_plots as plot\n", "import mlai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "xlim = (1875,2030)\n", "ylim = (2.5, 6.5)\n", "yhat = (y-offset)/scale\n", "\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "_ = ax.plot(x, y, 'r.',markersize=10)\n", "ax.set_xlabel('year', fontsize=20)\n", "ax.set_ylabel('pace min/km', fontsize=20)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "\n", "mlai.write_figure(figure=fig, \n", " filename='../slides/diagrams/datasets/olympic-marathon.svg', \n", " transparent=True, \n", " frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Olympic marathon pace times since 1892.\n", "\n", "Things to notice about the data include the outlier in 1904, in this\n", "year, the olympics was in St Louis, USA. Organizational problems and\n", "challenges with dust kicked up by the cars following the race meant that\n", "participants got lost, and only very few participants completed.\n", "\n", "More recent years see more consistently quick marathons.\n", "\n", "## Olympic Data with Bayesian Polynomials \\[edit\\]\n", "\n", "Five fold cross validation tests the ability of the model to\n", "*interpolate*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_BLM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml/', \n", " num_basis=IntSlider(1, 1, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bayesian fit with 26th degree polynomial and negative\n", "marginal log likelihood.\n", "\n", "## Hold Out Validation\n", "\n", "For the polynomial fit, we will now look at *hold out* validation, where\n", "we are holding out some of the most recent points. This tests the abilit\n", "of our model to *extrapolate*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_val_BLM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " num_basis=IntSlider(1, 1, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bayesian fit with 26th degree polynomial and hold out\n", "validation scores.\n", "\n", "## 5-fold Cross Validation\n", "\n", "Five fold cross validation tests the ability of the model to\n", "*interpolate*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_5cv{part:0>2}_BLM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " part=(0, 5), \n", " num_basis=IntSlider(1, 1, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bayesian fit with 26th degree polynomial and five fold cross\n", "validation scores.\n", "\n", "## Sampling from the Posterior \\[edit\\]\n", "\n", "Before we were able to sample the prior values for the mean\n", "*independently* from a Gaussian using `np.random.normal` and scaling the\n", "result. However, observing the data *correlates* the parameters. Recall\n", "this from the first lab where we had a correlation between the offset,\n", "$c$ and the slope $m$ which caused such problems with the coordinate\n", "ascent algorithm. We need to sample from a *correlated* Gaussian. For\n", "this we can use `np.random.multivariate_normal`.\n", "\n", "Now let's sample several functions and plot them all to see how the\n", "predictions fluctuate.\n", "\n", "This gives us an idea of what our predictions are. These are the\n", "predictions that are consistent with data and our prior. Try plotting\n", "different numbers of predictions. You can also try plotting beyond the\n", "range of where the data is and see what the functions do there.\n", "\n", "Rather than sampling from the posterior each time to compute our\n", "predictions, it might be better if we just summarised the predictions by\n", "the expected value of the output funciton, $\\mappingFunction(x)$, for\n", "any particular input. If we can get formulae for this we don't need to\n", "sample the values of $\\mappingFunction(x)$ we might be able to compute\n", "the distribution directly. Fortunately, in the Gaussian case, we can use\n", "properties of multivariate Gaussians to compute both the mean and the\n", "variance of these samples.\n", "\n", "## Properties of Gaussian Variables\n", "\n", "Gaussian variables have very particular properties, that many other\n", "densities don't exhibit. Perhaps foremost amoungst them is that the sum\n", "of any Gaussian distributed set of random variables also turns out to be\n", "Gaussian distributed. This property is much rarer than you might expect.\n", "\n", "## Sum of Gaussian-distributed Variables\n", "\n", "The sum of Gaussian random variables is also Gaussian, so if we have a\n", "random variable $\\dataScalar_i$ drawn from a Gaussian density with mean\n", "$\\meanScalar_i$ and variance $\\dataStd^2_i$, $$\n", "\\dataScalar_i \\sim \\gaussianSamp{\\meanScalar_i}{\\dataStd^2_i}\n", "$$ Then the sum of $K$ independently sampled values of $\\dataScalar_i$\n", "will be drawn from a Gaussian with mean $\\sum_{i=1}^K \\mu_i$ and\n", "variance $\\sum_{i=1}^K \\dataStd_i^2$, $$\n", "\\sum_{i=1}^K \\dataScalar_i \\sim \\gaussianSamp{\\sum_{i=1}^K \\meanScalar_i}{\\sum_{i=1}^K \\dataStd_i^2}.\n", "$$ Let's try that experimentally. First let's generate a vector of\n", "samples from a standard normal distribution,\n", "$z \\sim \\gaussianSamp{0}{1}$, then we will scale and offset them, then\n", "keep adding them into a vector `y_vec`.\n", "\n", "## Sampling from Gaussians and Summing Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "K = 10 # how many Gaussians to add.\n", "num_samples = 1000 # how many samples to have in y_vec\n", "mus = np.linspace(0, 5, K) # mean values generated linearly spaced between 0 and 5\n", "sigmas = np.linspace(0.5, 2, K) # sigmas generated linearly spaced between 0.5 and 2\n", "y_vec = np.zeros(num_samples)\n", "for mu, sigma in zip(mus, sigmas):\n", " z_vec = np.random.normal(size=num_samples) # z is from standard normal\n", " y_vec += z_vec*sigma + mu # add to y z*sigma + mu\n", "\n", "# now y_vec is the sum of each scaled and off set z.\n", "print('Sample mean is ', y_vec.mean(), ' and sample variance is ', y_vec.var())\n", "print('True mean should be ', mus.sum())\n", "print('True variance should be ', (sigmas**2).sum(), ' standard deviation ', np.sqrt((sigmas**2).sum())) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, we can histogram `y_vec` as well.\n", "\n", "## Matrix Multiplication of Gaussian Variables\n", "\n", "We are interested in what our model is saying about the sort of\n", "functions we are observing. The fact that summing of Gaussian variables\n", "leads to new Gaussian variables, and scaling of Gaussian variables\n", "*also* leads to Gaussian variables means that matrix multiplication\n", "(which is just a series of sums and scales) also leads to Gaussian\n", "densities. Matrix multiplication is just adding and scaling together, in\n", "the formula, $\\mappingFunctionVector = \\basisMatrix \\mappingVector$ we\n", "can extract the first element from $\\mappingFunctionVector$ as $$\n", "\\mappingFunction_i = \\basisVector_i^\\top \\mappingVector\n", "$$ where $\\basisVector$ is a column vector from the $i$th row of\n", "$\\basisMatrix$ and $\\mappingFunction_i$ is the $i$th element of\n", "$\\mappingFunctionVector$.This vector inner product itself merely implies\n", "that $$\n", "\\mappingFunction_i = \\sum_{j=1}^K \\mappingScalar_j \\basisScalar_{i, j}\n", "$$ and if we now say that $\\mappingScalar_i$ is Gaussian distributed,\n", "then because a scaled Gaussian is also Gaussian, and because a sum of\n", "Gaussians is also Gaussian, we know that $\\mappingFunction_i$ is also\n", "Gaussian distributed. It merely remains to work out its mean and\n", "covariance. We can do this by looking at the expectation under a\n", "Gaussian distribution. The expectation of the mean vector is given by $$\n", "\\expDist{\\mappingFunctionVector}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\int \\mappingFunctionVector\n", "\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}\n", "\\text{d}\\mappingVector = \\int \\basisMatrix\\mappingVector\n", "\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}\n", "\\text{d}\\mappingVector = \\basisMatrix \\int \\mappingVector\n", "\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}\n", "\\text{d}\\mappingVector = \\basisMatrix \\meanVector\n", "$$\n", "\n", "Which is straightforward. The expectation of\n", "$\\mappingFunctionVector=\\basisMatrix\\mappingVector$ under the Gaussian\n", "distribution for $\\mappingFunctionVector$ is simply\n", "$\\mappingFunctionVector=\\basisMatrix\\meanVector$, where $\\meanVector$ is\n", "the *mean* of the Gaussian density for $\\mappingVector$. Because our\n", "prior distribution was Gaussian with zero mean, the expectation under\n", "the prior is given by $$\n", "\\expDist{\\mappingFunctionVector}{\\gaussianDist{\\mappingVector}{\\zerosVector}{\\alpha\\eye}} = \\zerosVector\n", "$$\n", "\n", "The covariance is a little more complicated. A covariance matrix is\n", "defined as $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}}\n", "= \\expDist{\\mappingFunctionVector\\mappingFunctionVector^\\top}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}}\n", "- \\expDist{\\mappingFunctionVector}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}}\\expDist{\\mappingFunctionVector}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}}^\\top\n", "$$ we've already computed\n", "$\\expDist{\\mappingFunctionVector}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}}=\\basisMatrix \\meanVector$\n", "so we can substitute that in to recover $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\expDist{\\mappingFunctionVector\\mappingFunctionVector^\\top}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} - \\basisMatrix \\meanVector \\meanVector^\\top \\basisMatrix^\\top\n", "$$\n", "\n", "So we need the expectation of\n", "$\\mappingFunctionVector\\mappingFunctionVector^\\top$. Substituting in\n", "$\\mappingFunctionVector = \\basisMatrix \\mappingVector$ we have $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\expDist{\\basisMatrix\\mappingVector\\mappingVector^\\top \\basisMatrix^\\top}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} - \\basisMatrix \\meanVector \\meanVector^\\top \\basisMatrix^\\top\n", "$$ $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\basisMatrix\\expDist{\\mappingVector\\mappingVector^\\top}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} \\basisMatrix^\\top - \\basisMatrix \\meanVector \\meanVector^\\top\\basisMatrix^\\top\n", "$$ Which is dependent on the second moment of the Gaussian, $$\n", "\\expDist{\\mappingVector\\mappingVector^\\top}{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\covarianceMatrix + \\meanVector\\meanVector^\\top\n", "$$ that can be substituted in to recover, $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\basisMatrix\\covarianceMatrix \\basisMatrix^\\top\n", "$$ so in the case of the prior distribution, where we have\n", "$\\covarianceMatrix = \\alpha \\eye$ we can write $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\zerosVector}{\\alpha \\eye}} = \\alpha \\basisMatrix \\basisMatrix^\\top\n", "$$\n", "\n", "This implies that the prior we have suggested for $\\mappingVector$,\n", "which is Gaussian with a mean of zero and covariance of $\\alpha \\eye$\n", "suggests that the distribution for $\\mappingVector$ is also Gaussian\n", "with a mean of zero and covariance of\n", "$\\alpha \\basisMatrix\\basisMatrix^\\top$. Since our observed output,\n", "$\\dataVector$, is given by a noise corrupted variation of\n", "$\\mappingFunctionVector$, the final distribution for $\\dataVector$ is\n", "given as $$\n", "\\dataVector = \\mappingFunctionVector + \\noiseVector\n", "$$ where the noise, $\\noiseVector$, is sampled from a Gaussian density:\n", "$\\noiseVector \\sim \\gaussianSamp{\\zerosVector}{\\dataStd^2\\eye}$. So, in\n", "other words, we are taking a Gaussian distributed random value\n", "$\\mappingFunctionVector$, $$\n", "\\mappingFunctionVector \\sim \\gaussianSamp{\\zerosVector}{\\alpha\\basisMatrix\\basisMatrix^\\top}\n", "$$ and adding to it another Gaussian distributed value,\n", "$\\noiseVector \\sim \\gaussianSamp{\\zerosVector}{\\dataStd^2\\eye}$, to form\n", "our data observations, $\\dataVector$. Once again the sum of two\n", "(multivariate) Gaussian distributed variables is also Gaussian, with a\n", "mean given by the sum of the means (both zero in this case) and the\n", "covariance given by the sum of the covariances. So we now have that the\n", "marginal likelihood for the data, $p(\\dataVector)$ is given by $$\n", "p(\\dataVector) = \\gaussianDist{\\dataVector}{\\zerosVector}{\\alpha \\basisMatrix \\basisMatrix^\\top + \\dataStd^2\\eye}\n", "$$ This is our *implicit* assumption for $\\dataVector$ given our prior\n", "assumption for $\\mappingVector$.\n", "\n", "## Marginal Likelihood \\[edit\\]\n", "\n", "- The marginal likelihood can also be computed, it has the form: $$\n", " p(\\dataVector|\\inputMatrix, \\dataStd^2, \\alpha) = \\frac{1}{(2\\pi)^\\frac{n}{2}\\left|\\kernelMatrix\\right|^\\frac{1}{2}} \\exp\\left(-\\frac{1}{2} \\dataVector^\\top \\kernelMatrix^{-1} \\dataVector\\right)\n", " $$ where\n", " $\\kernelMatrix = \\alpha \\basisMatrix\\basisMatrix^\\top + \\dataStd^2 \\eye$.\n", "\n", "- So it is a zero mean $\\numData$-dimensional Gaussian with covariance\n", " matrix $\\kernelMatrix$.\n", "\n", "## Computing the Mean and Error Bars of the Functions \\[edit\\]\n", "\n", "These ideas together, now allow us to compute the mean and error bars of\n", "the predictions. The mean prediction, before corrupting by noise is\n", "given by, $$\n", "\\mappingFunctionVector = \\basisMatrix\\mappingVector\n", "$$ in matrix form. This gives you enough information to compute the\n", "predictive mean.\n", "\n", "### Question 2\n", "\n", "Compute the predictive mean for the function at all the values of the\n", "basis function given by `Phi_pred`. Call the vector of predictions\n", "`\\mappingFunction_pred_mean`. Plot the predictions alongside the data.\n", "We can also compute what the training error was. Use the output from\n", "your model to compute the predictive mean, and then compute the sum of\n", "squares error of that predictive mean. $$\n", "E = \\sum_{i=1}^\\numData (\\dataScalar_i - \\langle \\mappingFunction_i\\rangle)^2\n", "$$ where $\\langle \\mappingFunction_i\\rangle$ is the expected output of\n", "the model at point $\\inputScalar_i$.\n", "\n", "*15 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 2 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n", "# compute mean under posterior density\n", "f_pred_mean = \n", "\n", "# plot the predictions\n", "\n", "# compute mean at the training data and sum of squares error\n", "f_mean = \n", "sum_squares = \n", "print('The error is: ', sum_squares)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing Error Bars\n", "\n", "Finally, we can compute error bars for the predictions. The error bars\n", "are the standard deviations of the predictions for\n", "$\\mappingFunctionVector=\\basisMatrix\\mappingVector$ under the posterior\n", "density for $\\mappingVector$. The standard deviations of these\n", "predictions can be found from the variance of the prediction at each\n", "point. Those variances are the diagonal entries of the covariance\n", "matrix. We've already computed the form of the covariance under Gaussian\n", "expectations, $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector}{\\covarianceMatrix}} = \\basisMatrix\\covarianceMatrix \\basisMatrix^\\top\n", "$$ which under the posterior density is given by $$\n", "\\text{cov}\\left(\\mappingFunctionVector\\right)_{\\gaussianDist{\\mappingVector}{\\meanVector_w}{\\covarianceMatrix_w}} = \\basisMatrix\\covarianceMatrix_w \\basisMatrix^\\top\n", "$$\n", "\n", "### Question 3\n", "\n", "The error bars are given by computing the standard deviation of the\n", "predictions, $\\mappingFunction$. For a given prediction\n", "$\\mappingFunction_i$ the variance is\n", "$\\text{var}(\\mappingFunction_i) = \\langle \\mappingFunction_i^2\\rangle - \\langle \\mappingFunction_i \\rangle^2$.\n", "This is given by the diagonal element of the covariance of\n", "$\\mappingFunctionVector$, $$\n", "\\text{var}(\\mappingFunction_i) =\n", "\\basisVector_{i, :}^\\top \\covarianceMatrix_w \\basisVector_{i, :}\n", "$$ where $\\basisVector_{i, :}$ is the basis vector associated with the\n", "input location, $\\inputVector_i$.\n", "\n", "Plot the mean function and the error bars for your basis.\n", "\n", "*20 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 3 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n", "# Compute variance at function values\n", "f_pred_var = \n", "f_pred_std = \n", "\n", "# plot the mean and error bars at 2 standard deviations above and below the mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validation\n", "\n", "Now we will test the generalisation ability of these models. Firstly we\n", "are going to use hold out validation to attempt to see which model is\n", "best for extrapolating.\n", "\n", "### Question 4\n", "\n", "Now split the data into training and *hold out* validation sets. Hold\n", "out the data for years after 1980. Compute the predictions for different\n", "model orders between 0 and 8. Find the model order which fits best\n", "according to *hold out* validation. Is it the same as the maximum\n", "likelihood result fom last week?\n", "\n", "*25 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 4 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 5\n", "\n", "Now we will use leave one out cross validation to attempt to see which\n", "model is best at interpolating. Do you get the same result as for hold\n", "out validation? Compare plots of the hold out validation area for\n", "different degrees and the cross validation error for different degrees.\n", "Why are they so different? Select a suitable polynomial for\n", "characterising the differences in the predictions. Plot the mean\n", "function and the error bars for the full data set (to represent the\n", "leave one out solution) and the training data from the hold out\n", "experiment. Discuss your answer.\n", "\n", "*30 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 5 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\addreading{@Rogers:book11}{Section 3.7–3.8 (pg 122–133)}\n", "\\addreading{@Bishop:book06}{Section 3.4 (pg 161–165)}\n", "\\reading\n", "# References\n", "\n", "[^1]: Note not all exponentiated quadratics can be normalized, to do so,\n", " the coefficient associated with the variable squared,\n", " $\\dataScalar^2$, must be strictly positive." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }