{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Uncertainty in Loss Functions\n", "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n", "### 2018-05-29\n", "\n", "**Abstract**: Bayesian formalisms deal with uncertainty in parameters, frequentist\n", "formalisms deal with the *risk* of a data set, uncertainty in the data\n", "sample. In this talk, we consider uncertainty in the *loss function*.\n", "Uncertainty in the loss function. We introduce uncertainty through\n", "linear weightings of terms in the loss function and show how a\n", "distribution over the loss can be maintained through the *maximum\n", "entropy principle*. This allows us minimize the expected loss under our\n", "maximum entropy distribution of the loss function. We recover weighted\n", "least squares and a LOESS-like regression from the formalism.\n", "\n", "$$\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "## What is Machine Learning?\n", "\n", "What is machine learning? At its most basic level machine learning is a\n", "combination of\n", "\n", "$$ \\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", "where *data* is our observations. They can be actively or passively\n", "acquired (meta-data). The *model* contains our assumptions, based on\n", "previous experience. THat experience can be other data, it can come from\n", "transfer learning, or it can merely be our beliefs about the\n", "regularities of the universe. In humans our models include our inductive\n", "biases. The *prediction* is an action to be taken or a categorization or\n", "a quality score. The reason that machine learning has become a mainstay\n", "of artificial intelligence is the importance of predictions in\n", "artificial intelligence. The data and the model are combined through\n", "computation.\n", "\n", "In practice we normally perform machine learning using two functions. To\n", "combine data with a model we typically make use of:\n", "\n", "**a prediction function** a function which is used to make the\n", "predictions. It includes our beliefs about the regularities of the\n", "universe, our assumptions about how the world works, e.g. smoothness,\n", "spatial similarities, temporal similarities.\n", "\n", "**an objective function** a function which defines the cost of\n", "misprediction. Typically it includes knowledge about the world's\n", "generating processes (probabilistic objectives) or the costs we pay for\n", "mispredictions (empiricial risk minimization).\n", "\n", "The combination of data and model through the prediction function and\n", "the objectie function leads to a *learning algorithm*. The class of\n", "prediction functions and objective functions we can make use of is\n", "restricted by the algorithms they lead to. If the prediction function or\n", "the objective function are too complex, then it can be difficult to find\n", "an appropriate learning algorithm. Much of the acdemic field of machine\n", "learning is the quest for new learning algorithms that allow us to bring\n", "different types of models and data together.\n", "\n", "A useful reference for state of the art in machine learning is the UK\n", "Royal Society Report, [Machine Learning: Power and Promise of Computers\n", "that Learn by\n", "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n", "\n", "You can also check my blog post on [\"What is Machine\n", "Learning?\"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)\n", "\n", "## Artificial vs Natural Systems\n", "\n", "### Natural Systems are Evolved\n", "\n", "> Survival of the fittest\n", ">\n", "> [Herbet Spencer](https://en.wikipedia.org/wiki/Herbert_Spencer), 1864\n", "\n", "Darwin never said \"Survival of the Fittest\" he talked about evolution by\n", "natural selection.\n", "\n", "Evolution is better described as \"non-survival of the non-fit\". You\n", "don't have to be the fittest to survive, you just need to avoid the\n", "pitfalls of life. This is the first priority.\n", "\n", "A mistake we make in the design of our systems is to equate fitness with\n", "the objective function, and to assume it is known and static. In\n", "practice, a real environment would have an evolving fitness function\n", "which would be unknown at any given time.\n", "\n", "Uncertainty in models is handled by Bayesian inference, here we consider\n", "uncertainty arising in loss functions.\n", "\n", "Consider a loss function which decomposes across individual\n", "observations, $\\dataScalar_{k,j}$, each of which is dependent on some\n", "set of features, $\\inputVector_k$.\n", "\n", "$$\n", "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{k}\\sum_{j}\n", "L(\\dataScalar_{k,j}, \\inputVector_k)\n", "$$ Assume that the loss function depends on the features through some\n", "mapping function, $\\mappingFunction_j(\\cdot)$ which we call the\n", "*prediction function*.\n", "\n", "$$\n", "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{k}\\sum_{j} L(\\dataScalar_{k,j},\n", "\\mappingFunction_j(\\inputVector_k))\n", "$$ without loss of generality, we can move the index to the inputs, so\n", "we have $\\inputVector_i =\\left[\\inputVector \\quad j\\right]$, and we set\n", "$\\dataScalar_i = \\dataScalar_{k, j}$. So we have\n", "\n", "$$\n", "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{i} L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n", "$$ Bayesian inference considers uncertainty in $\\mappingFunction$, often\n", "through parameterizing it,\n", "$\\mappingFunction(\\inputVector; \\parameterVector)$, and considering a\n", "*prior* distribution for the parameters, $p(\\parameterVector)$, this in\n", "turn implies a distribution over functions, $p(\\mappingFunction)$.\n", "Process models, such as Gaussian processes specify this distribution,\n", "known as a process, directly.\n", "\n", "Bayesian inference proceeds by specifying a *likelihood* which relates\n", "the data, $\\dataScalar$, to the parameters. Here we choose not to do\n", "this, but instead we only consider the *loss* function for our\n", "objective. The loss is the cost we pay for a misclassification.\n", "\n", "The *risk function* is the expectation of the loss under the\n", "distribution of the data. Here we are using the framework of *empirical\n", "risk* minimization, because we have a sample based approximation. The\n", "new expectation we are considering is around the loss function itself,\n", "not the uncertainty in the data.\n", "\n", "The loss function and the log likelihood may take a mathematically\n", "similar form but they are philosophically very different. The log\n", "likelihood assumes something about the *generating* function of the\n", "data, whereas the loss function assumes something about the cost we pay.\n", "Importantly the loss function in Bayesian inference only normally enters\n", "at the point of decision.\n", "\n", "The key idea in Bayesian inference is that the probabilistic inference\n", "can be performed *without* knowing the loss becasue if the model is\n", "correct, then the form of the loss function is irrelevant when\n", "performing inference. In practice, however, for real data sets the model\n", "is almost never correct.\n", "\n", "Some of the maths below looks similar to the maths we can find in\n", "Bayesian methods, in particular variational Bayes, but that is merely a\n", "consequence of the availability of analytical mathematics. There are\n", "only particular ways of developing tractable algorithms, one route\n", "involves linear algebra. However, the similarity of the mathematics\n", "belies a difference in interpretation. It is similar to travelling a\n", "road (e.g. Ermine Street) in a wild landscape. We travel together\n", "because that is where efficient progress is to be made, but in practice\n", "a our destinations (Lincoln, York), may be different.\n", "\n", "### Introduce Uncertainty\n", "\n", "To introduce uncertainty we consider a weighted version of the loss\n", "function, we introduce positive weights,\n", "$\\left\\{ \\scaleScalar_i\\right\\}_{i=1}^\\numData$. $$\n", "\\errorFunction(\\dataVector, \\inputMatrix) = \\sum_{i}\n", "\\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n", "$$ We now assume that tmake the assumption that these weights are drawn\n", "from a distribution, $q(\\scaleScalar)$. Instead of looking to minimize\n", "the loss direction, we look at the expected loss under this\n", "distribution.\n", "\n", "$$\n", "\\begin{align*}\n", "\\errorFunction(\\dataVector, \\inputMatrix) = & \\sum_{i}\\expectationDist{\\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))}{q(\\scaleScalar)} \\\\\n", "& =\\sum_{i}\\expectationDist{\\scaleScalar_i }{q(\\scaleScalar)}L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n", "\\end{align*}\n", "$$ We will assume that our process, $q(\\scaleScalar)$ can depend on a\n", "variety of inputs such as $\\dataVector$, $\\inputMatrix$ and time, $t$.\n", "\n", "### Principle of Maximum Entropy\n", "\n", "To maximize uncertainty in $q(\\scaleScalar)$ we maximize its entropy.\n", "Following Jaynes formalism of maximum entropy, in the continuous space\n", "we do this with respect to an invariant measure, $$\n", "H(\\scaleScalar)= - \\int q(\\scaleScalar) \\log \\frac{q(\\scaleScalar)}{m(\\scaleScalar)} \\text{d}\\scaleScalar\n", "$$ and since we minimize the loss, we balance this by adding in this\n", "term to form $$\n", "\\begin{align*}\n", "\\errorFunction = & \\beta\\sum_{i}\\expectationDist{\\scaleScalar_i }{q(\\scaleScalar)}L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) - H(\\scaleScalar)\\\\\n", "&= \\beta\\sum_{i}\\expectationDist{\\scaleScalar_i }{q(\\scaleScalar)}L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) + \\int q(\\scaleScalar) \\log \\frac{q(\\scaleScalar)}{m(\\scaleScalar)}\\text{d}\\scaleScalar\n", "\\end{align*}\n", "$$ where $\\beta$ serves to weight the relative contribution of the\n", "entropy term and the loss term.\n", "\n", "We can now minimize this modified loss with respect to the density\n", "$q(\\scaleScalar)$, the freeform optimization over this term leads to $$\n", "\\begin{align*}\n", "q(\\scaleScalar) \\propto & \\exp\\left(- \\beta \\sum_{i=1}^\\numData \\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) \\right) m(\\scaleScalar)\\\\\n", " \\propto & \\prod_{i=1}^\\numData \\exp\\left(- \\beta \\scaleScalar_i L(\\dataScalar_i, \\mappingFunction(\\inputVector_i)) \\right) m(\\scaleScalar)\n", "\\end{align*}\n", "$$\n", "\n", "### Example\n", "\n", "Assume $$\n", "m(\\scaleScalar) = \\prod_i \\lambda\\exp\\left(-\\lambda\\scaleScalar_i\\right)\n", "$$ which is the distribution with the maximum entropy for a given mean,\n", "$\\scaleScalar$. Then we have $$ \n", "q(\\scaleScalar) = \\prod_i q(\\scaleScalar_i)\n", "$$ $$\n", "q(\\scaleScalar_i) \\propto \\frac{1}{\\lambda+\\beta L_i} \\exp\\left(-(\\lambda+\\beta L_i) \\scaleScalar_i\\right)\n", "$$ and we can compute $$\n", "\\expectationDist{\\scaleScalar_i}{q(\\scaleScalar)} =\n", "\\frac{1}{\\lambda + \\beta L_i}\n", "$$\n", "\n", "### Coordinate Descent\n", "\n", "We can minimize with respect to $q(\\scaleScalar)$ recovering, $$\n", "q(\\scaleScalar_i) = \\frac{1}{\\lambda+\\beta L_i} \\exp\\left(-(\\lambda+\\beta L_i) \\scaleScalar_i\\right)\n", "$$t allowing us to compute the expectation of $\\scaleScalar$, $$\n", "\\expectationDist{\\scaleScalar_i}{q(\\scaleScalar_i)} = \\frac{1}{\\lambda+\\beta\n", "L_i}\n", "$$ then, we can minimize our expected loss with respect to\n", "$\\mappingFunction(\\cdot)$ $$\n", "\\beta \\sum_{i=1}^\\numData \\expectationDist{\\scaleScalar_i}{q(\\scaleScalar_i)} L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))\n", "$$ If the loss is the *squared loss*, then this is recognised as a\n", "*reweighted least squares algorithm*. However, the loss can be of any\n", "form as long as $q(\\scaleScalar)$ defined above exists.\n", "\n", "In addition to the above, in our example below, we updated $\\beta$ to\n", "normalize the expected loss to be $\\numData$ at each iteration, so we\n", "have $$\n", "\\beta = \\frac{\\numData}{\\sum_{i=1}^\\numData \\expectationDist{\\scaleScalar_i}{q(\\scaleScalar_i)} L(\\dataScalar_i, \\mappingFunction(\\inputVector_i))}\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pods\n", "import teaching_plots as plot\n", "import mlai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Olympic Marathon Data\n", "\n", "The first thing we will do is load a standard data set for regression\n", "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n", "winners for the Olympics from 1896 to present. First we load in the data\n", "and plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.olympic_marathon_men()\n", "x = data['X']\n", "y = data['Y']\n", "\n", "offset = y.mean()\n", "scale = np.sqrt(y.var())\n", "\n", "xlim = (1875,2030)\n", "ylim = (2.5, 6.5)\n", "yhat = (y-offset)/scale\n", "\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "_ = ax.plot(x, y, 'r.',markersize=10)\n", "ax.set_xlabel('year', fontsize=20)\n", "ax.set_ylabel('pace min/km', fontsize=20)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "\n", "mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Olympic Marathon Data\n", "\n", "
\n", "- Gold medal times for Olympic Marathon since 1896.\n", "\n", "- Marathons before 1924 didn’t have a standardised distance.\n", "\n", "- Present results using pace per km.\n", "\n", "- In 1904 Marathon was badly organised leading to very slow times.\n", "\n", " | \n", "\n",
"![image](../slides/diagrams/Stephen_Kiprotich.jpg) Image from\n",
"Wikimedia Commons | \n",
"