{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Uncertainty in Loss Functions\n",
    "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n",
    "### 2018-05-29\n",
    "\n",
    "**Abstract**: Bayesian formalisms deal with uncertainty in parameters, frequentist\n",
    "formalisms deal with the *risk* of a data set, uncertainty in the data\n",
    "sample. In this talk, we consider uncertainty in the *loss function*.\n",
    "Uncertainty in the loss function. We introduce uncertainty through\n",
    "linear weightings of terms in the loss function and show how a\n",
    "distribution over the loss can be maintained through the *maximum\n",
    "entropy principle*. This allows us minimize the expected loss under our\n",
    "maximum entropy distribution of the loss function. We recover weighted\n",
    "least squares and a LOESS-like regression from the formalism.\n",
    "\n",
    "$$\n",
    "\\newcommand{\\Amatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n",
    "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n",
    "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n",
    "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n",
    "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n",
    "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n",
    "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n",
    "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n",
    "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n",
    "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n",
    "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\aMatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\aScalar}{a}\n",
    "\\newcommand{\\aVector}{\\mathbf{a}}\n",
    "\\newcommand{\\acceleration}{a}\n",
    "\\newcommand{\\bMatrix}{\\mathbf{B}}\n",
    "\\newcommand{\\bScalar}{b}\n",
    "\\newcommand{\\bVector}{\\mathbf{b}}\n",
    "\\newcommand{\\basisFunc}{\\phi}\n",
    "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n",
    "\\newcommand{\\basisFunction}{\\phi}\n",
    "\\newcommand{\\basisLocation}{\\mu}\n",
    "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n",
    "\\newcommand{\\basisScalar}{\\basisFunction}\n",
    "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n",
    "\\newcommand{\\activationFunction}{\\phi}\n",
    "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n",
    "\\newcommand{\\activationScalar}{\\basisFunction}\n",
    "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n",
    "\\newcommand{\\bigO}{\\mathcal{O}}\n",
    "\\newcommand{\\binomProb}{\\pi}\n",
    "\\newcommand{\\cMatrix}{\\mathbf{C}}\n",
    "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n",
    "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n",
    "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n",
    "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n",
    "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n",
    "\\newcommand{\\centeredKernelScalar}{b}\n",
    "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n",
    "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n",
    "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n",
    "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n",
    "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n",
    "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n",
    "\\newcommand{\\coregionalizationScalar}{b}\n",
    "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n",
    "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n",
    "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n",
    "\\newcommand{\\covarianceScalar}{c}\n",
    "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n",
    "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n",
    "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n",
    "\\newcommand{\\croupierScalar}{s}\n",
    "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n",
    "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n",
    "\\newcommand{\\dataDim}{p}\n",
    "\\newcommand{\\dataIndex}{i}\n",
    "\\newcommand{\\dataIndexTwo}{j}\n",
    "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n",
    "\\newcommand{\\dataScalar}{y}\n",
    "\\newcommand{\\dataSet}{\\mathcal{D}}\n",
    "\\newcommand{\\dataStd}{\\sigma}\n",
    "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n",
    "\\newcommand{\\decayRate}{d}\n",
    "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n",
    "\\newcommand{\\degreeScalar}{d}\n",
    "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n",
    "% Already defined by latex\n",
    "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n",
    "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n",
    "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n",
    "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n",
    "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n",
    "\\newcommand{\\displacement}{x}\n",
    "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n",
    "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n",
    "\\newcommand{\\distanceScalar}{d}\n",
    "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n",
    "\\newcommand{\\eigenvaltwo}{\\ell}\n",
    "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n",
    "\\newcommand{\\eigenvalue}{\\lambda}\n",
    "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n",
    "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n",
    "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n",
    "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n",
    "\\newcommand{\\eigenvectorScalar}{u}\n",
    "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n",
    "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n",
    "\\newcommand{\\eigenvectwoScalar}{v}\n",
    "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n",
    "\\newcommand{\\errorFunction}{E}\n",
    "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n",
    "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n",
    "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n",
    "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n",
    "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n",
    "\\newcommand{\\eye}{\\mathbf{I}}\n",
    "\\newcommand{\\fantasyDim}{r}\n",
    "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n",
    "\\newcommand{\\fantasyScalar}{z}\n",
    "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n",
    "\\newcommand{\\featureStd}{\\varsigma}\n",
    "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n",
    "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n",
    "\\newcommand{\\given}{|}\n",
    "\\newcommand{\\half}{\\frac{1}{2}}\n",
    "\\newcommand{\\heaviside}{H}\n",
    "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n",
    "\\newcommand{\\hiddenScalar}{h}\n",
    "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n",
    "\\newcommand{\\identityMatrix}{\\eye}\n",
    "\\newcommand{\\inducingInputScalar}{z}\n",
    "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n",
    "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n",
    "\\newcommand{\\inducingScalar}{u}\n",
    "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n",
    "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n",
    "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n",
    "\\newcommand{\\inputDim}{q}\n",
    "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n",
    "\\newcommand{\\inputScalar}{x}\n",
    "\\newcommand{\\inputSpace}{\\mathcal{X}}\n",
    "\\newcommand{\\inputVals}{\\inputVector}\n",
    "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n",
    "\\newcommand{\\iterNum}{k}\n",
    "\\newcommand{\\kernel}{\\kernelScalar}\n",
    "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n",
    "\\newcommand{\\kernelScalar}{k}\n",
    "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n",
    "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n",
    "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n",
    "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n",
    "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n",
    "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n",
    "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n",
    "\\newcommand{\\lagrangian}{L}\n",
    "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n",
    "\\newcommand{\\laplacianFactorScalar}{m}\n",
    "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n",
    "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\laplacianScalar}{\\ell}\n",
    "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n",
    "\\newcommand{\\latentDim}{q}\n",
    "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n",
    "\\newcommand{\\latentDistanceScalar}{\\delta}\n",
    "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n",
    "\\newcommand{\\latentForce}{f}\n",
    "\\newcommand{\\latentFunction}{u}\n",
    "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n",
    "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n",
    "\\newcommand{\\latentIndex}{j}\n",
    "\\newcommand{\\latentScalar}{z}\n",
    "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n",
    "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n",
    "\\newcommand{\\learnRate}{\\eta}\n",
    "\\newcommand{\\lengthScale}{\\ell}\n",
    "\\newcommand{\\rbfWidth}{\\ell}\n",
    "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n",
    "\\newcommand{\\likelihoodFunction}{L}\n",
    "\\newcommand{\\locationScalar}{\\mu}\n",
    "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n",
    "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n",
    "\\newcommand{\\mappingFunction}{f}\n",
    "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n",
    "\\newcommand{\\mappingFunctionTwo}{g}\n",
    "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n",
    "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n",
    "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n",
    "\\newcommand{\\scaleScalar}{s}\n",
    "\\newcommand{\\mappingScalar}{w}\n",
    "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n",
    "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n",
    "\\newcommand{\\mappingScalarTwo}{v}\n",
    "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n",
    "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n",
    "\\newcommand{\\maxIters}{K}\n",
    "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\meanScalar}{\\mu}\n",
    "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\meanTwoScalar}{m}\n",
    "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n",
    "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n",
    "\\newcommand{\\mrnaConcentration}{m}\n",
    "\\newcommand{\\naturalFrequency}{\\omega}\n",
    "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n",
    "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n",
    "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n",
    "\\newcommand{\\noiseScalar}{\\epsilon}\n",
    "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n",
    "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n",
    "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n",
    "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n",
    "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n",
    "\\newcommand{\\numActive}{m}\n",
    "\\newcommand{\\numBasisFunc}{m}\n",
    "\\newcommand{\\numComponents}{m}\n",
    "\\newcommand{\\numComps}{K}\n",
    "\\newcommand{\\numData}{n}\n",
    "\\newcommand{\\numFeatures}{K}\n",
    "\\newcommand{\\numHidden}{h}\n",
    "\\newcommand{\\numInducing}{m}\n",
    "\\newcommand{\\numLayers}{\\ell}\n",
    "\\newcommand{\\numNeighbors}{K}\n",
    "\\newcommand{\\numSequences}{s}\n",
    "\\newcommand{\\numSuccess}{s}\n",
    "\\newcommand{\\numTasks}{m}\n",
    "\\newcommand{\\numTime}{T}\n",
    "\\newcommand{\\numTrials}{S}\n",
    "\\newcommand{\\outputIndex}{j}\n",
    "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n",
    "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n",
    "\\newcommand{\\parameterScalar}{\\theta}\n",
    "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n",
    "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n",
    "\\newcommand{\\precisionScalar}{j}\n",
    "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n",
    "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n",
    "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n",
    "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n",
    "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n",
    "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n",
    "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n",
    "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n",
    "\\newcommand{\\responsibility}{r}\n",
    "\\newcommand{\\rotationScalar}{r}\n",
    "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n",
    "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n",
    "\\newcommand{\\sampleCovScalar}{s}\n",
    "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n",
    "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n",
    "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n",
    "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n",
    "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n",
    "\\newcommand{\\singularvalue}{\\ell}\n",
    "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n",
    "\\newcommand{\\sorth}{\\mathbf{u}}\n",
    "\\newcommand{\\spar}{\\lambda}\n",
    "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n",
    "\\newcommand{\\BasalRate}{B}\n",
    "\\newcommand{\\DampingCoefficient}{C}\n",
    "\\newcommand{\\DecayRate}{D}\n",
    "\\newcommand{\\Displacement}{X}\n",
    "\\newcommand{\\LatentForce}{F}\n",
    "\\newcommand{\\Mass}{M}\n",
    "\\newcommand{\\Sensitivity}{S}\n",
    "\\newcommand{\\basalRate}{b}\n",
    "\\newcommand{\\dampingCoefficient}{c}\n",
    "\\newcommand{\\mass}{m}\n",
    "\\newcommand{\\sensitivity}{s}\n",
    "\\newcommand{\\springScalar}{\\kappa}\n",
    "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n",
    "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n",
    "\\newcommand{\\tfConcentration}{p}\n",
    "\\newcommand{\\tfDecayRate}{\\delta}\n",
    "\\newcommand{\\tfMrnaConcentration}{f}\n",
    "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n",
    "\\newcommand{\\velocity}{v}\n",
    "\\newcommand{\\sufficientStatsScalar}{g}\n",
    "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n",
    "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n",
    "\\newcommand{\\switchScalar}{s}\n",
    "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n",
    "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n",
    "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n",
    "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n",
    "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n",
    "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n",
    "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n",
    "\\newcommand{\\vScalar}{v}\n",
    "\\newcommand{\\vVector}{\\mathbf{v}}\n",
    "\\newcommand{\\vMatrix}{\\mathbf{V}}\n",
    "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n",
    "% Already defined by latex\n",
    "%\\newcommand{\\vec}{#1:}\n",
    "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n",
    "\\newcommand{\\weightScalar}{w}\n",
    "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n",
    "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n",
    "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\weightedAdjacencyScalar}{a}\n",
    "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n",
    "\\newcommand{\\onesVector}{\\mathbf{1}}\n",
    "\\newcommand{\\zerosVector}{\\mathbf{0}}\n",
    "$$\n",
    "\n",
    "## What is Machine Learning?\n",
    "\n",
    "What is machine learning? At its most basic level machine learning is a\n",
    "combination of\n",
    "\n",
    "$$ \\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n",
    "\n",
    "where *data* is our observations. They can be actively or passively\n",
    "acquired (meta-data). The *model* contains our assumptions, based on\n",
    "previous experience. THat experience can be other data, it can come from\n",
    "transfer learning, or it can merely be our beliefs about the\n",
    "regularities of the universe. In humans our models include our inductive\n",
    "biases. The *prediction* is an action to be taken or a categorization or\n",
    "a quality score. The reason that machine learning has become a mainstay\n",
    "of artificial intelligence is the importance of predictions in\n",
    "artificial intelligence. The data and the model are combined through\n",
    "computation.\n",
    "\n",
    "In practice we normally perform machine learning using two functions. To\n",
    "combine data with a model we typically make use of:\n",
    "\n",
    "**a prediction function** a function which is used to make the\n",
    "predictions. It includes our beliefs about the regularities of the\n",
    "universe, our assumptions about how the world works, e.g. smoothness,\n",
    "spatial similarities, temporal similarities.\n",
    "\n",
    "**an objective function** a function which defines the cost of\n",
    "misprediction. Typically it includes knowledge about the world's\n",
    "generating processes (probabilistic objectives) or the costs we pay for\n",
    "mispredictions (empiricial risk minimization).\n",
    "\n",
    "The combination of data and model through the prediction function and\n",
    "the objectie function leads to a *learning algorithm*. The class of\n",
    "prediction functions and objective functions we can make use of is\n",
    "restricted by the algorithms they lead to. If the prediction function or\n",
    "the objective function are too complex, then it can be difficult to find\n",
    "an appropriate learning algorithm. Much of the acdemic field of machine\n",
    "learning is the quest for new learning algorithms that allow us to bring\n",
    "different types of models and data together.\n",
    "\n",
    "A useful reference for state of the art in machine learning is the UK\n",
    "Royal Society Report, [Machine Learning: Power and Promise of Computers\n",
    "that Learn by\n",
    "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n",
    "\n",
    "You can also check my blog post on [\"What is Machine\n",
    "Learning?\"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)\n",
    "\n",
    "## Artificial vs Natural Systems\n",
    "\n",
    "### Natural Systems are Evolved\n",
    "\n",
    "> Survival of the fittest\n",
    ">\n",
    "> [Herbet Spencer](https://en.wikipedia.org/wiki/Herbert_Spencer), 1864\n",
    "\n",
    "Darwin never said \"Survival of the Fittest\" he talked about evolution by\n",
    "natural selection.\n",
    "\n",
    "Evolution is better described as \"non-survival of the non-fit\". You\n",
    "don't have to be the fittest to survive, you just need to avoid the\n",
    "pitfalls of life. This is the first priority.\n",
    "\n",
    "A mistake we make in the design of our systems is to equate fitness with\n",
    "the objective function, and to assume it is known and static. In\n",
    "practice, a real environment would have an evolving fitness function\n",
    "which would be unknown at any given time.\n",
    "\n",
    "Uncertainty in models is handled by Bayesian inference, here we consider\n",
    "uncertainty arising in loss functions.\n",
    "\n",
    "Consider a loss function which decomposes across individual\n",
    "observations, $\\dataScalar_{k,j}$, each of which is dependent on some\n",
    "set of features, $\\inputVector_k$.\n",
    "\n",
    "$$\n",
    "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{k}\\sum_{j}\n",
    "L(\\dataScalar_{k,j}, \\inputVector_k)\n",
    "$$ Assume that the loss function depends on the features through some\n",
    "mapping function, $\\mappingFunction_j(\\cdot)$ which we call the\n",
    "*prediction function*.\n",
    "\n",
    "$$\n",
    "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{k}\\sum_{j} L(\\dataScalar_{k,j},\n",
    "\\mappingFunction_j(\\inputVector_k))\n",
    "$$ without loss of generality, we can move the index to the inputs, so\n",
    "we have $\\inputVector_i =\\left[\\inputVector \\quad j\\right]$, and we set\n",
    "$\\dataScalar_i = \\dataScalar_{k, j}$. So we have\n",
    "\n",
    "$$\n",
    "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{i} L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n",
    "$$ Bayesian inference considers uncertainty in $\\mappingFunction$, often\n",
    "through parameterizing it,\n",
    "$\\mappingFunction(\\inputVector; \\parameterVector)$, and considering a\n",
    "*prior* distribution for the parameters, $p(\\parameterVector)$, this in\n",
    "turn implies a distribution over functions, $p(\\mappingFunction)$.\n",
    "Process models, such as Gaussian processes specify this distribution,\n",
    "known as a process, directly.\n",
    "\n",
    "Bayesian inference proceeds by specifying a *likelihood* which relates\n",
    "the data, $\\dataScalar$, to the parameters. Here we choose not to do\n",
    "this, but instead we only consider the *loss* function for our\n",
    "objective. The loss is the cost we pay for a misclassification.\n",
    "\n",
    "The *risk function* is the expectation of the loss under the\n",
    "distribution of the data. Here we are using the framework of *empirical\n",
    "risk* minimization, because we have a sample based approximation. The\n",
    "new expectation we are considering is around the loss function itself,\n",
    "not the uncertainty in the data.\n",
    "\n",
    "The loss function and the log likelihood may take a mathematically\n",
    "similar form but they are philosophically very different. The log\n",
    "likelihood assumes something about the *generating* function of the\n",
    "data, whereas the loss function assumes something about the cost we pay.\n",
    "Importantly the loss function in Bayesian inference only normally enters\n",
    "at the point of decision.\n",
    "\n",
    "The key idea in Bayesian inference is that the probabilistic inference\n",
    "can be performed *without* knowing the loss becasue if the model is\n",
    "correct, then the form of the loss function is irrelevant when\n",
    "performing inference. In practice, however, for real data sets the model\n",
    "is almost never correct.\n",
    "\n",
    "Some of the maths below looks similar to the maths we can find in\n",
    "Bayesian methods, in particular variational Bayes, but that is merely a\n",
    "consequence of the availability of analytical mathematics. There are\n",
    "only particular ways of developing tractable algorithms, one route\n",
    "involves linear algebra. However, the similarity of the mathematics\n",
    "belies a difference in interpretation. It is similar to travelling a\n",
    "road (e.g. Ermine Street) in a wild landscape. We travel together\n",
    "because that is where efficient progress is to be made, but in practice\n",
    "a our destinations (Lincoln, York), may be different.\n",
    "\n",
    "### Introduce Uncertainty\n",
    "\n",
    "To introduce uncertainty we consider a weighted version of the loss\n",
    "function, we introduce positive weights,\n",
    "$\\left\\{ \\scaleScalar_i\\right\\}_{i=1}^\\numData$. $$\n",
    "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{i}\n",
    "\\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n",
    "$$ We now assume that tmake the assumption that these weights are drawn\n",
    "from a distribution, $q(\\scaleScalar)$. Instead of looking to minimize\n",
    "the loss direction, we look at the expected loss under this\n",
    "distribution.\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\errorFunction(\\dataVector, \\inputMatrix) = & \\sum_{i}\\expectationDist{\\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))}{q(\\scaleScalar)} \\\\\n",
    "& =\\sum_{i}\\expectationDist{\\scaleScalar_i }{q(\\scaleScalar)}L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n",
    "\\end{align*}\n",
    "$$ We will assume that our process, $q(\\scaleScalar)$ can depend on a\n",
    "variety of inputs such as $\\dataVector$, $\\inputMatrix$ and time, $t$.\n",
    "\n",
    "### Principle of Maximum Entropy\n",
    "\n",
    "To maximize uncertainty in $q(\\scaleScalar)$ we maximize its entropy.\n",
    "Following Jaynes formalism of maximum entropy, in the continuous space\n",
    "we do this with respect to an invariant measure, $$\n",
    "H(\\scaleScalar)= - \\int q(\\scaleScalar) \\log \\frac{q(\\scaleScalar)}{m(\\scaleScalar)} \\text{d}\\scaleScalar\n",
    "$$ and since we minimize the loss, we balance this by adding in this\n",
    "term to form $$\n",
    "\\begin{align*}\n",
    "\\errorFunction = & \\beta\\sum_{i}\\expectationDist{\\scaleScalar_i }{q(\\scaleScalar)}L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) - H(\\scaleScalar)\\\\\n",
    "&= \\beta\\sum_{i}\\expectationDist{\\scaleScalar_i }{q(\\scaleScalar)}L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) +  \\int q(\\scaleScalar) \\log \\frac{q(\\scaleScalar)}{m(\\scaleScalar)}\\text{d}\\scaleScalar\n",
    "\\end{align*}\n",
    "$$ where $\\beta$ serves to weight the relative contribution of the\n",
    "entropy term and the loss term.\n",
    "\n",
    "We can now minimize this modified loss with respect to the density\n",
    "$q(\\scaleScalar)$, the freeform optimization over this term leads to $$\n",
    "\\begin{align*}\n",
    "q(\\scaleScalar) \\propto & \\exp\\left(- \\beta \\sum_{i=1}^\\numData \\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) \\right) m(\\scaleScalar)\\\\\n",
    " \\propto & \\prod_{i=1}^\\numData \\exp\\left(- \\beta \\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) \\right) m(\\scaleScalar)\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "### Example\n",
    "\n",
    "Assume $$\n",
    "m(\\scaleScalar) = \\prod_i \\lambda\\exp\\left(-\\lambda\\scaleScalar_i\\right)\n",
    "$$ which is the distribution with the maximum entropy for a given mean,\n",
    "$\\scaleScalar$. Then we have $$ \n",
    "q(\\scaleScalar) = \\prod_i q(\\scaleScalar_i)\n",
    "$$ $$\n",
    "q(\\scaleScalar_i) \\propto \\frac{1}{\\lambda+\\beta L_i} \\exp\\left(-(\\lambda+\\beta L_i) \\scaleScalar_i\\right)\n",
    "$$ and we can compute $$\n",
    "\\expectationDist{\\scaleScalar_i}{q(\\scaleScalar)} =\n",
    "\\frac{1}{\\lambda + \\beta L_i}\n",
    "$$\n",
    "\n",
    "### Coordinate Descent\n",
    "\n",
    "We can minimize with respect to $q(\\scaleScalar)$ recovering, $$\n",
    "q(\\scaleScalar_i) = \\frac{1}{\\lambda+\\beta L_i} \\exp\\left(-(\\lambda+\\beta L_i) \\scaleScalar_i\\right)\n",
    "$$t allowing us to compute the expectation of $\\scaleScalar$, $$\n",
    "\\expectationDist{\\scaleScalar_i}{q(\\scaleScalar_i)} = \\frac{1}{\\lambda+\\beta\n",
    "L_i}\n",
    "$$ then, we can minimize our expected loss with respect to\n",
    "$\\mappingFunction(\\cdot)$ $$\n",
    "\\beta \\sum_{i=1}^\\numData \\expectationDist{\\scaleScalar_i}{q(\\scaleScalar_i)} L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n",
    "$$ If the loss is the *squared loss*, then this is recognised as a\n",
    "*reweighted least squares algorithm*. However, the loss can be of any\n",
    "form as long as $q(\\scaleScalar)$ defined above exists.\n",
    "\n",
    "In addition to the above, in our example below, we updated $\\beta$ to\n",
    "normalize the expected loss to be $\\numData$ at each iteration, so we\n",
    "have $$\n",
    "\\beta = \\frac{\\numData}{\\sum_{i=1}^\\numData \\expectationDist{\\scaleScalar_i}{q(\\scaleScalar_i)} L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pods\n",
    "import teaching_plots as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Olympic Marathon Data\n",
    "\n",
    "The first thing we will do is load a standard data set for regression\n",
    "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n",
    "winners for the Olympics from 1896 to present. First we load in the data\n",
    "and plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "offset = y.mean()\n",
    "scale = np.sqrt(y.var())\n",
    "\n",
    "xlim = (1875,2030)\n",
    "ylim = (2.5, 6.5)\n",
    "yhat = (y-offset)/scale\n",
    "\n",
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "_ = ax.plot(x, y, 'r.',markersize=10)\n",
    "ax.set_xlabel('year', fontsize=20)\n",
    "ax.set_ylabel('pace min/km', fontsize=20)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "\n",
    "mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Olympic Marathon Data\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"70%\">\n",
    "-   Gold medal times for Olympic Marathon since 1896.\n",
    "\n",
    "-   Marathons before 1924 didn’t have a standardised distance.\n",
    "\n",
    "-   Present results using pace per km.\n",
    "\n",
    "-   In 1904 Marathon was badly organised leading to very slow times.\n",
    "\n",
    "</td>\n",
    "<td width=\"30%\">\n",
    "![image](../slides/diagrams/Stephen_Kiprotich.jpg) <small>Image from\n",
    "Wikimedia Commons <http://bit.ly/16kMKHQ></small>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "<img src=\"../slides/diagrams/datasets/olympic-marathon.svg\" align=\"\">\n",
    "\n",
    "Things to notice about the data include the outlier in 1904, in this\n",
    "year, the olympics was in St Louis, USA. Organizational problems and\n",
    "challenges with dust kicked up by the cars following the race meant that\n",
    "participants got lost, and only very few participants completed.\n",
    "\n",
    "More recent years see more consistently quick marathons.\n",
    "\n",
    "### Example: Linear Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mlai\n",
    "import numpy as np\n",
    "import scipy as sp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a weighted linear regression class, inheriting from the `mlai.LM`\n",
    "class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class LML(mlai.LM):\n",
    "    \"\"\"Linear model with evolving loss\n",
    "    :param X: input values\n",
    "    :type X: numpy.ndarray\n",
    "    :param y: target values\n",
    "    :type y: numpy.ndarray\n",
    "    :param basis: basis function \n",
    "    :param type: function\n",
    "    :param beta: weight of the loss function\n",
    "    :param type: float\"\"\"\n",
    "\n",
    "    def __init__(self, X, y, basis=None, beta=1.0, lambd=1.0):\n",
    "        \"Initialise\"\n",
    "        if basis is None:\n",
    "            basis = mlai.basis(mlai.polynomial, number=2)\n",
    "        mlai.LM.__init__(self, X, y, basis)\n",
    "        self.s = np.ones((self.num_data, 1))#np.random.rand(self.num_data, 1)>0.5\n",
    "        self.update_w()\n",
    "        self.sigma2 = 1/beta\n",
    "        self.beta = beta\n",
    "        self.name = 'LML_'+basis.function.__name__\n",
    "        self.objective_name = 'Weighted Sum of Square Training Error'\n",
    "        self.lambd = lambd\n",
    "\n",
    "    def update_QR(self):\n",
    "        \"Perform the QR decomposition on the basis matrix.\"\n",
    "        self.Q, self.R = np.linalg.qr(self.Phi*np.sqrt(self.s))\n",
    "\n",
    "    def fit(self):\n",
    "        \"\"\"Minimize the objective function with respect to the parameters\"\"\"\n",
    "        for i in range(30):\n",
    "            self.update_w() # In the linear regression clas\n",
    "            self.update_s()\n",
    "        \n",
    "    def update_w(self):\n",
    "        self.update_QR()\n",
    "        self.w_star = sp.linalg.solve_triangular(self.R, np.dot(self.Q.T, self.y*np.sqrt(self.s)))\n",
    "        self.update_losses()\n",
    "\n",
    "    def predict(self, X):\n",
    "        \"\"\"Return the result of the prediction function.\"\"\"\n",
    "        return np.dot(self.basis.Phi(X), self.w_star), None\n",
    "        \n",
    "    def update_s(self):\n",
    "        \"\"\"Update the weights\"\"\"\n",
    "        self.s = 1/(self.lambd + self.beta*self.losses)\n",
    "                                                 \n",
    "    def update_losses(self):\n",
    "        \"\"\"Compute the loss functions for each data point.\"\"\"\n",
    "        self.update_f()\n",
    "        self.losses = ((self.y-self.f)**2)\n",
    "        self.beta = 1/(self.losses*self.s).mean()\n",
    "        \n",
    "    def objective(self):\n",
    "        \"\"\"Compute the objective function.\"\"\"\n",
    "        self.update_losses()\n",
    "        return (self.losses*self.s).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up a linear model (polynomial with two basis functions)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_basis=2 \n",
    "data_limits=[1890, 2020]\n",
    "basis = mlai.basis(mlai.polynomial, num_basis, data_limits=data_limits)\n",
    "model = LML(x, y, basis=basis, lambd=1, beta=1)\n",
    "model2 = mlai.LM(x, y, basis=basis)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit()\n",
    "model2.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_test = np.linspace(data_limits[0], data_limits[1], 130)[:, None]\n",
    "f_test, f_var = model.predict(x_test)\n",
    "f2_test, f2_var = model2.predict(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot\n",
    "from matplotlib import rc, rcParams\n",
    "rcParams.update({'font.size': 22})\n",
    "rc('text', usetex=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "ax.plot(x_test, f2_test, linewidth=3, color='r')\n",
    "ax.plot(x, y, 'g.', markersize=10)\n",
    "ax.set_xlim(data_limits[0], data_limits[1])\n",
    "ax.set_xlabel('year')\n",
    "ax.set_ylabel('pace min/km')\n",
    "_ = ax.set_ylim(2, 6)\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-loss-linear-regression000.svg', transparent=True)\n",
    "ax.plot(x_test, f_test, linewidth=3, color='b')\n",
    "ax.plot(x, y, 'g.', markersize=10)\n",
    "ax2 = ax.twinx()\n",
    "ax2.bar(x.flatten(), model.s.flatten(), width=2, color='b')\n",
    "ax2.set_ylim(0, 4)\n",
    "ax2.set_yticks([0, 1, 2])\n",
    "ax2.set_ylabel('$\\langle s_i \\\\rangle$')\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-loss-linear-regression001.svg', transparent=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "pods.notebook.display_plots('olympic-loss-linear-regression{number:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', number=(0, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../slides/diagrams/ml/olympic-loss-linear-regression001.svg\" align=\"\">\n",
    "<center>\n",
    "*Linear regression for the standard quadratic loss in *red\\* and the\n",
    "probabilistically weighted loss in *blue*. \\*\n",
    "</center>\n",
    "### Parameter Uncertainty\n",
    "\n",
    "Classical Bayesian inference is concerned with parameter uncertainty,\n",
    "which equates to uncertainty in the *prediction function*,\n",
    "$\\mappingFunction(\\inputVector)$. The prediction function is normally an\n",
    "estimate of the value of $\\dataScalar$ or constructs a probability\n",
    "density for $\\dataScalar$.\n",
    "\n",
    "Uncertainty in the prediction function can arise through uncertainty in\n",
    "our loss function, but also through uncertainty in parameters in the\n",
    "classical Bayesian sense. The full maximum entropy formalism would now\n",
    "be $$\n",
    "\\expectationDist{\\beta \\scaleScalar_i L(\\dataScalar_i,\n",
    "\\mappingFunction(\\inputVector_i))}{q(\\scaleScalar, \\mappingFunction)} + \\int\n",
    "q(\\scaleScalar, \\mappingFunction) \\log \\frac{q(\\scaleScalar,\n",
    "\\mappingFunction)}{m(\\scaleScalar)m(\\mappingFunction)}\\text{d}\\scaleScalar\n",
    "\\text{d}\\mappingFunction\n",
    "$$\n",
    "\n",
    "$$\n",
    "q(\\mappingFunction, \\scaleScalar) \\propto\n",
    "\\prod_{i=1}^\\numData \\exp\\left(- \\beta \\scaleScalar_i L(\\dataScalar_i,\n",
    "\\mappingFunction(\\inputVector_i)) \\right) m(\\scaleScalar)m(\\mappingFunction)\n",
    "$$\n",
    "\n",
    "### Approximation\n",
    "\n",
    "-   Generally intractable, so assume: $$\n",
    "    q(\\mappingFunction, \\scaleScalar) = q(\\mappingFunction)q(\\scaleScalar)\n",
    "    $$\n",
    "\n",
    "-   Entropy maximization proceeds as before but with $$\n",
    "    q(\\scaleScalar) \\propto\n",
    "    \\prod_{i=1}^\\numData \\exp\\left(- \\beta \\scaleScalar_i \\expectationDist{L(\\dataScalar_i,\n",
    "    \\mappingFunction(\\inputVector_i))}{q(\\mappingFunction)} \\right) m(\\scaleScalar)\n",
    "    $$ and $$\n",
    "    q(\\mappingFunction) \\propto\n",
    "    \\prod_{i=1}^\\numData \\exp\\left(- \\beta \\expectationDist{\\scaleScalar_i}{q(\\scaleScalar)} L(\\dataScalar_i,\n",
    "    \\mappingFunction(\\inputVector_i)) \\right) m(\\mappingFunction)\n",
    "    $$\n",
    "\n",
    "-   Can now proceed with iteration between $q(\\scaleScalar)$,\n",
    "    $q(\\mappingFunction)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class BLML(mlai.BLM):\n",
    "    \"\"\"Bayesian Linear model with evolving loss\n",
    "    :param X: input values\n",
    "    :type X: numpy.ndarray\n",
    "    :param y: target values\n",
    "    :type y: numpy.ndarray\n",
    "    :param basis: basis function \n",
    "    :param type: function\n",
    "    :param beta: weight of the loss function\n",
    "    :param type: float\"\"\"\n",
    "\n",
    "    def __init__(self, X, y, basis=None, alpha=1.0, beta=1.0, lambd=1.0):\n",
    "        \"Initialise\"\n",
    "        if basis is None:\n",
    "            basis = mlai.basis(mlai.polynomial, number=2)\n",
    "        mlai.BLM.__init__(self, X, y, basis=basis, alpha=alpha, sigma2=1/beta)\n",
    "        self.s = np.ones((self.num_data, 1))#np.random.rand(self.num_data, 1)>0.5       \n",
    "        self.update_w()\n",
    "        self.beta = beta\n",
    "        self.name = 'BLML_'+basis.function.__name__\n",
    "        self.objective_name = 'Weighted Sum of Square Training Error'\n",
    "        self.lambd = lambd     \n",
    "\n",
    "    def update_QR(self):\n",
    "        \"Perform the QR decomposition on the basis matrix.\"\n",
    "        self.Q, self.R = np.linalg.qr(np.vstack([self.Phi*np.sqrt(self.s), np.sqrt(self.sigma2/self.alpha)*np.eye(self.basis.number)]))\n",
    "\n",
    "    def fit(self):\n",
    "        \"\"\"Minimize the objective function with respect to the parameters\"\"\"\n",
    "        for i in range(30):\n",
    "            self.update_w()\n",
    "            self.update_s()\n",
    "        \n",
    "    def update_w(self):\n",
    "        self.update_QR()\n",
    "        self.QTy = np.dot(self.Q[:self.y.shape[0], :].T, self.y*np.sqrt(self.s))\n",
    "        self.mu_w = sp.linalg.solve_triangular(self.R, self.QTy)\n",
    "        self.RTinv = sp.linalg.solve_triangular(self.R, np.eye(self.R.shape[0]), trans='T')\n",
    "        self.C_w = np.dot(self.RTinv, self.RTinv.T)\n",
    "        self.update_losses()\n",
    "\n",
    "    def update_s(self):\n",
    "        \"\"\"Update the weights\"\"\"\n",
    "        self.s = 1/(self.lambd + self.beta*self.losses)\n",
    "                                                 \n",
    "    def update_losses(self):\n",
    "        \"\"\"Compute the loss functions for each data point.\"\"\"\n",
    "        self.update_f()\n",
    "        self.losses = ((self.y-self.f_bar)**2) + self.f_cov[:, np.newaxis]\n",
    "        self.beta = 1/(self.losses*self.s).mean()\n",
    "        self.sigma2=1/self.beta\n",
    "        \n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = BLML(x, y, basis=basis, alpha=1000, lambd=1, beta=1)\n",
    "model2 = mlai.BLM(x, y, basis=basis, alpha=1000, sigma2=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit()\n",
    "model2.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_test = np.linspace(data_limits[0], data_limits[1], 130)[:, None]\n",
    "f_test, f_var = model.predict(x_test)\n",
    "f2_test, f2_var = model2.predict(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gp_tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "from matplotlib import rc, rcParams\n",
    "rcParams.update({'font.size': 22})\n",
    "rc('text', usetex=True)\n",
    "gp_tutorial.gpplot(x_test, f2_test, f2_test - 2*np.sqrt(f2_var), f2_test + 2*np.sqrt(f2_var), ax=ax, edgecol='r', fillcol='#CC3300')\n",
    "ax.plot(x, y, 'g.', markersize=10)\n",
    "ax.set_xlim(data_limits[0], data_limits[1])\n",
    "ax.set_xlabel('year')\n",
    "ax.set_ylabel('pace min/km')\n",
    "_ = ax.set_ylim(2, 6)\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-loss-bayes-linear-regression000.svg', transparent=True)\n",
    "gp_tutorial.gpplot(x_test, f_test, f_test - 2*np.sqrt(f_var), f_test + 2*np.sqrt(f_var), ax=ax, edgecol='b', fillcol='#0033CC')\n",
    "#ax.plot(x_test, f_test, linewidth=3, color='b')\n",
    "ax.plot(x, y, 'g.', markersize=10)\n",
    "ax2 = ax.twinx()\n",
    "ax2.bar(x.flatten(), model.s.flatten(), width=2, color='b')\n",
    "ax2.set_ylim(0, 0.2)\n",
    "ax2.set_yticks([0, 0.1, 0.2])\n",
    "ax2.set_ylabel('$\\langle s_i \\\\rangle$')\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-loss-bayes-linear-regression001.svg', transparent=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "pods.notebook.display_plots('olympic-loss-bayes-linear-regression{number:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', number=(0, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../slides/diagrams/ml/olympic-loss-bayes-linear-regression001.svg\" align=\"\">\n",
    "<center>\n",
    "*Probabilistic linear regression for the standard quadratic loss in\n",
    "*red\\* and the probabilistically weighted loss in *blue*. \\*\n",
    "</center>\n",
    "### Correlated Scales\n",
    "\n",
    "Going beyond independence between weights, we now consider $m(\\vScalar)$\n",
    "to be a Gaussian process, and scale by the *square* of $\\vScalar$,\n",
    "$\\scaleScalar=\\vScalar^2$ $$\n",
    "\\vScalar \\sim \\mathcal{GP}\\left(\\meanScalar(\\inputVector), \\kernel(\\inputVector, \\inputVector^\\prime)\\right)\n",
    "$$\n",
    "\n",
    "$$\n",
    "q(\\vScalar) \\propto\n",
    "\\prod_{i=1}^\\numData \\exp\\left(- \\beta \\vScalar_i^2 L(\\dataScalar_i,\n",
    "\\mappingFunction(\\inputVector_i)) \\right)\n",
    "\\exp\\left(-\\frac{1}{2}(\\vVector-\\meanVector)^\\top \\kernelMatrix^{-1}\n",
    "(\\vVector-\\meanVector)\\right)\n",
    "$$ where $\\kernelMatrix$ is the covariance of the process made up of\n",
    "elements taken from the covariance function,\n",
    "$\\kernelScalar(\\inputVector, t, \\dataVector; \\inputVector^\\prime, t^\\prime, \\dataVector^\\prime)$\n",
    "so $q(\\vScalar)$ itself is Gaussian with covariance $$\n",
    "\\covarianceMatrix = \\left(\\beta\\mathbf{L} + \\kernelMatrix^{-1}\\right)^{-1}\n",
    "$$ and mean $$\n",
    "\\meanTwoVector = \\beta\\covarianceMatrix\\mathbf{L}\\meanVector\n",
    "$$ where $\\mathbf{L}$ is a matrix containing the loss functions,\n",
    "$L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))$ along its diagonal\n",
    "elements with zeros elsewhere.\n",
    "\n",
    "The update is given by $$\n",
    "\\expectationDist{\\vScalar_i^2}{q(\\vScalar)} = \\meanTwoScalar_i^2 +\n",
    "\\covarianceScalar_{i, i}.\n",
    "$$ To compare with before, if the mean of the measure $m(\\vScalar)$ was\n",
    "zero and the prior covariance was spherical,\n",
    "$\\kernelMatrix=\\lambda^{-1}\\eye$. Then this would equate to an update,\n",
    "$$\n",
    "\\expectationDist{\\vScalar_i^2}{q(\\vScalar)} = \\frac{1}{\\lambda + \\beta L_i}\n",
    "$$ which is the same as we had before for the exponential prior over\n",
    "$\\scaleScalar$.\n",
    "\n",
    "### Conditioning the Measure\n",
    "\n",
    "Now that we have defined a process over $\\vScalar$, we could define a\n",
    "region in which we're certain that we would like the weights to be high.\n",
    "For example, if we were looking to have a test point at location\n",
    "$\\inputVector_\\ast$, we could update our measure to be a Gaussian\n",
    "process that is conditioned on the observation of $\\vScalar_\\ast$ set\n",
    "appropriately at $\\inputScalar_\\ast$. In this case we have,\n",
    "\n",
    "$$\n",
    "\\kernelMatrix^\\prime = \\kernelMatrix - \\frac{\\kernelVector_\\ast\\kernelVector^\\top_\\ast}{\\kernelScalar_{*,*}}\n",
    "$$ and $$\n",
    "\\meanVector^\\prime = \\meanVector + \\frac{\\kernelVector_\\ast}{\\kernelScalar_{*,*}}\n",
    "(\\vScalar_\\ast-\\meanScalar)\n",
    "$$ where $\\kernelScalar_\\ast$ is the vector computed through the\n",
    "covariance function between the training data $\\inputMatrix$ and the\n",
    "proposed point that we are conditioning the scale upon,\n",
    "$\\inputVector_\\ast$ and $\\kernelScalar_{*,*}$ is the covariance function\n",
    "computed for $\\inputVector_\\ast$. Now the updated mean and covariance\n",
    "can be used in the maximum entropy formulation as before. $$\n",
    "q(\\vScalar) \\propto \\prod_{i=1}^\\numData \\exp\\left(-\n",
    "\\beta \\vScalar_i^2 L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) \\right)\n",
    "\\exp\\left(-\\frac{1}{2}(\\vVector-\\meanVector^\\prime)^\\top\n",
    "\\left.\\kernelMatrix^\\prime\\right.^{-1} (\\vVector-\\meanVector^\\prime)\\right)\n",
    "$$\n",
    "\n",
    "We will consider the same data set as above. We first create a Gaussian\n",
    "process model for the update."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GPL(mlai.GP):\n",
    "    def __init__(self, X, losses, kernel, beta=1.0, mu=0.0, X_star=None, v_star=None):\n",
    "        # Bring together locations\n",
    "        self.kernel = kernel\n",
    "        self.K = self.kernel.K(X)\n",
    "        self.mu = np.ones((X.shape[0],1))*mu\n",
    "        self.beta = beta\n",
    "        if X_star is not None:\n",
    "            kstar = kernel.K(X, X_star)\n",
    "            kstarstar = kernel.K(X_star, X_star)\n",
    "            kstarstarInv = np.linalg.inv(kstarstar)\n",
    "            kskssInv = np.dot(kstar, kstarstarInv)\n",
    "            self.K -= np.dot(kskssInv,kstar.T)\n",
    "            if v_star is not None:\n",
    "                self.mu = kskssInv*(v_star-self.mu)+self.mu\n",
    "                Xaug = np.vstack((X, X_star))\n",
    "            else:\n",
    "                raise ValueError(\"v_star should not be None when X_star is None\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class BLMLGP(BLML):\n",
    "    def __init__(self, X, y, basis=None, kernel=None, beta=1.0, mu=0.0, alpha=1.0, X_star=None, v_star=None):\n",
    "        BLML.__init__(self, X, y, basis=basis, alpha=alpha, beta=beta, lambd=None)\n",
    "        self.gp_model=GPL(self.X, self.losses, kernel=kernel, beta=beta, mu=mu, X_star=X_star, v_star=v_star)\n",
    "    def update_s(self):\n",
    "        \"\"\"Update the weights\"\"\"\n",
    "        self.gp_model.C = sp.linalg.inv(sp.linalg.inv(self.gp_model.K+np.eye(self.X.shape[0])*1e-6) + self.beta*np.diag(self.losses.flatten()))\n",
    "        self.gp_model.diagC = np.diag(self.gp_model.C)[:, np.newaxis]\n",
    "        self.gp_model.f = self.gp_model.beta*np.dot(np.dot(self.gp_model.C,np.diag(self.losses.flatten())),self.gp_model.mu) +self.gp_model.mu\n",
    "        \n",
    "        #f, v = self.gp_model.K self.gp_model.predict(self.X)\n",
    "        self.s = self.gp_model.f*self.gp_model.f + self.gp_model.diagC # + 1.0/(self.losses*self.gp_model.beta)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = BLMLGP(x, y, \n",
    "           basis=basis, \n",
    "           kernel=mlai.kernel(mlai.eq_cov, lengthscale=20, variance=1.0),\n",
    "           mu=0.0,\n",
    "           beta=1.0, \n",
    "           alpha=1000,\n",
    "           X_star=np.asarray([[2020]]), \n",
    "           v_star=np.asarray([[1]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f_test, f_var = model.predict(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "ax.cla()\n",
    "from matplotlib import rc, rcParams\n",
    "rcParams.update({'font.size': 22})\n",
    "rc('text', usetex=True)\n",
    "gp_tutorial.gpplot(x_test, f2_test, f2_test - 2*np.sqrt(f2_var), f2_test + 2*np.sqrt(f2_var), ax=ax, edgecol='r', fillcol='#CC3300')\n",
    "ax.plot(x, y, 'g.', markersize=10)\n",
    "ax.set_xlim(data_limits[0], data_limits[1])\n",
    "ax.set_xlabel('year')\n",
    "ax.set_ylabel('pace min/km')\n",
    "_ = ax.set_ylim(2, 6)\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression000.svg', transparent=True)\n",
    "gp_tutorial.gpplot(x_test, f_test, f_test - 2*np.sqrt(f_var), f_test + 2*np.sqrt(f_var), ax=ax, edgecol='b', fillcol='#0033CC')\n",
    "#ax.plot(x_test, f_test, linewidth=3, color='b')\n",
    "ax.plot(x, y, 'g.', markersize=10)\n",
    "ax2 = ax.twinx()\n",
    "ax2.bar(x.flatten(), model.s.flatten(), width=2, color='b')\n",
    "ax2.set_ylim(0, 3)\n",
    "ax2.set_yticks([0, 0.5, 1])\n",
    "ax2.set_ylabel('$\\langle s_i \\\\rangle$')\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression001.svg', transparent=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('olympic-gp-loss-bayes-linear-regression{number:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', number=(0, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression001.svg\" align=\"\">\n",
    "<center>\n",
    "*Probabilistic linear regression for the standard quadratic loss in\n",
    "*red\\* and the probabilistically weighted loss with a Gaussian process\n",
    "measure in *blue*. \\*\n",
    "</center>\n",
    "Finally, we make an attempt to show the joint uncertainty by first of\n",
    "all sampling from the loss function weights density, $q(\\scaleScalar)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "num_samps=10\n",
    "samps=np.random.multivariate_normal(model.gp_model.f.flatten(), model.gp_model.C, size=100).T**2\n",
    "ax.plot(x, samps, '-x', markersize=10, linewidth=2)\n",
    "ax.set_xlim(data_limits[0], data_limits[1])\n",
    "ax.set_xlabel('year')\n",
    "_ = ax.set_ylabel('$s_i$')\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-gp-loss-samples.svg', transparent=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../slides/diagrams/ml/olympic-gp-loss-samples.svg\" align=\"\">\n",
    "<center>\n",
    "*Samples of loss weightings from the density $q(\\scaleSamples)$. *\n",
    "</center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "ax.plot(x, y, 'r.', markersize=10)\n",
    "ax.set_xlim(data_limits[0], data_limits[1])\n",
    "ax.set_ylim(2, 6)\n",
    "ax.set_xlabel('year')\n",
    "ax.set_ylabel('pace min/km')\n",
    "gp_tutorial.gpplot(x_test, f_test, f_test - 2*np.sqrt(f_var), f_test + 2*np.sqrt(f_var), ax=ax, edgecol='b', fillcol='#0033CC')\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples000.svg', transparent=True)\n",
    "allsamps = []\n",
    "for i in range(samps.shape[1]):\n",
    "    model.s = samps[:, i:i+1]\n",
    "    model.update_w()\n",
    "    f_bar, f_cov =model.predict(x_test, full_cov=True)\n",
    "    f_samp = np.random.multivariate_normal(f_bar.flatten(), f_cov, size=10).T\n",
    "    ax.plot(x_test, f_samp, linewidth=0.5, color='k')\n",
    "    allsamps+=list(f_samp[-1, :])\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples001.svg', transparent=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "pods.notebook.display_plots('olympic-gp-loss-bayes-linear-regression-and-samples{number:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', number=(0, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../slides/diagrams/ml/olympic-gp-loss-bayes-linear-regression-and-samples001.svg\" align=\"\">\n",
    "<center>\n",
    "*Samples from the joint density of loss weightings and regression\n",
    "weights show the full distribution of function predictions. *\n",
    "</center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=plot.big_figsize)\n",
    "ax.hist(np.asarray(allsamps), bins=30, density=True)\n",
    "ax.set_xlabel='pace min/kim'\n",
    "mlai.write_figure('../slides/diagrams/ml/olympic-gp-loss-histogram-2020.svg', transparent=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../slides/diagrams/ml/olympic-gp-loss-histogram-2020.svg\" align=\"\">\n",
    "<center>\n",
    "*Histogram of samples from the year 2020, where the weight of the loss\n",
    "function was pinned to ensure that the model focussed its predictions on\n",
    "this region for test data. *\n",
    "</center>\n",
    "### Conclusions\n",
    "\n",
    "-   Maximum Entropy Framework for uncertainty in\n",
    "    -   Loss functions\n",
    "    -   Prediction functions\n",
    "\n",
    "### Thanks!\n",
    "\n",
    "-   twitter: @lawrennd\n",
    "-   blog:\n",
    "    [http://inverseprobability.com](http://inverseprobability.com/blog.html)"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}