{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Modeling Things\n", "===============\n", "\n", "### [Neil D. Lawrence](http://inverseprobability.com), University of\n", "\n", "Cambridge \\#\\#\\# 2020-08-23" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abstract**: Machine learning solutions, in particular those based on\n", "deep learning methods, form an underpinning of the current revolution in\n", "“artificial intelligence” that has dominated popular press headlines and\n", "is having a significant influence on the wider tech agenda. In some ways\n", "the these deep learning methods are radically new: they raise questions\n", "about how we think of model regularization. Regularization arises\n", "implicitly through the optimization. Yet in others they remain rigidly\n", "traditional, and unsuited for an emerging world of unstructured,\n", "streaming data. In this paper we relate these new methods to traditional\n", "approaches and speculate on new directions that might take us beyond\n", "modeling structured data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\newcommand{\\tk}[1]{}\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Introduction\n", "============\n", "\n", "Machine learning and Statistics are academic cousins, founded on the\n", "same mathematical princples, but often with different objectives in\n", "mind. But the differences can be as informative as the overlaps.\n", "\n", "Efron (2020) rightly alludes to the fundamental differences to the new\n", "wave of predictive models that have arisen in the last decades of\n", "machine learning. And these cultures were also beautifully described by\n", "Breiman (2001a).\n", "\n", "In the discussion of Professor Efron’s paper Friedman et al. (2020)\n", "highlight the continuum between the classical approaches and the\n", "emphasis on prediction. Indeed, the prediction culture does not sit\n", "entirely in the machine learning domain, an excellent example of a\n", "prediction-focused approach would be Leo Breiman’s bagging of models\n", "(Breiman, 1996), although it’s notable that Breiman, a statistician,\n", "chose to publish this paper in a machine learning journal.\n", "\n", "From a personal perspective, a strand of work that is highly\n", "inspirational in prediction also comes from a statistician. The\n", "prequential formalism (Dawid, 1984, 1982) also emerges from statistics.\n", "It provides some hope that a predictive approach can be reconciled with\n", "attribution in the manner highlighted also by Friedman et al. (2020).\n", "The prequential approach is predictive but allows us to falsify poorly\n", "calibrated models (Lawrence, 2010). So while it doesn’t give us truth,\n", "it does give as falsehood in line with Popper’s vision of the philosophy\n", "of science (Popper, 1963)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lies and Damned Lies\n", "--------------------\n", "\n", "There is a quote, that is apocryphally credited to Disraeli by Mark\n", "Twain.\n", "\n", "> There are three kinds of lies: lies, damned lies and statistics\n", "\n", "It stems from the late 19th century. From a time after Laplace, Gauss,\n", "Legendre and Galton made their forays into regression, but before\n", "Fisher, Pearson and others had begun to formalize the process of\n", "estimation. Today, the academic discipline of statistics is so widely\n", "understood to be underpinned by mathematical rigor that we typically\n", "drop the fu full title of *mathematical statistics*, but this history\n", "can be informative when looking at the new challenges we face.\n", "\n", "The new phenomenon that underpins the challenges that Professor Efron\n", "outlines has been called “big data”. The vast quantities of data that\n", "are accumulating in the course of normal human activities. Where people\n", "have seen data, they have also seen the potential to draw inferences,\n", "but there is a lack of rigor about some of the approaches that means\n", "today we can see a new phenomenon.\n", "\n", "> There are three kinds of lies: lies, damned lies and big data\n", "\n", "The challenge we face is to repeat the work of Fisher, Pearson, Gosset,\n", "Neyman etc and give the new wave of approaches a rigorous mathematical\n", "foundation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Happenstance Data\n", "-----------------\n", "\n", "Following the revolution in mathematical statistics, data became a\n", "carefully curated commodity. It was actively collected in response to a\n", "scientific hypothesis. Popper suggests (Popper, 1963) that the answer to\n", "which comes first, the hypothesis or the data, is the same as the\n", "chicken and the egg. The answer is that they co-evolve.\n", "\n", "In the last century, the expense of data collection dominated the cost\n", "of science. As a result, the classical approach to statistical testing\n", "is to formulate the scientific question first. The study is then\n", "launched once an appropriate null hypothesis has been described and\n", "requisite sample size established.\n", "\n", "What Tukey described as confirmatory data analysis (Tukey, 1977) is a\n", "mainstay of statistics. While the philosophy of statistical hypothesis\n", "testing has been the subject of longstanding debates, there is no\n", "controversy around the notion that in order to remove confounders you\n", "must have a well-designed experiment, and randomization for statistical\n", "data collection is the foundation of confirmatory work. Today,\n", "randomized trials are deployed today more than ever before, in\n", "particular due to their widespread use in computer interface design.\n", "Without our knowledge, we are daily assigned to social experiments that\n", "place us in treatment and control groups to determine what dose of\n", "different interface ideas will keep us more tightly engaged with our\n", "machines. These A/B tests social experiments involve randomization\n", "across many millions of users and dictate our modern user experience\n", "(see e.g. Kohavi and Longbotham (2017)).\n", "\n", "Such experiments are still carefully designed to remain valid, but the\n", "modern data environment is not only about larger experimental data, but\n", "perhaps more so about what I term “happenstance data”. Data that was not\n", "collected with a particular purpose in mind, but which is simply being\n", "recorded in the normal course of events due to increasing\n", "interconnection between portable digital devices and decreasing cost of\n", "storage.\n", "\n", "Happenstance data are the breadcrumbs of our trail through the forest of\n", "life. They may be being written for a particular purpose, but later we\n", "wish to consume them for a different purpose. For example, within the\n", "recent Covid-19 pandemic, the Royal Society DELVE initiative (The DELVE\n", "Initiative, 2020) was able to draw on transaction data to give near-real\n", "time assessments on the effect of the pandemic and governmental response\n", "on GDP[1] (see also Carvalho et al. (2020)). The data wasn’t recorded\n", "with pandemic responses in mind, but it can be used to help inform\n", "interventions. Other data sets of use include mobility data from mobile\n", "telecoms companies (see e.g. Oliver et al. (2020)).\n", "\n", "Historically, data was expensive. It was carefully collected according\n", "to a design. Statistical surveys can still be expensive, but today there\n", "is a strong temptation to do them on the cheap, to use happenstance data\n", "to achieve what had been done in the past only through rigorous\n", "data-fieldwork, but care needs to be taken (Wang et al., 2015). A\n", "Professor Efron points out, early attempts to achieve this, such as the\n", "Google flu predictor have been somewhat naive (Ginsberg et al., 2009;\n", "Halevy et al., 2009).[2] As these methodologies are gaining traction in\n", "the social sciences (Salganik, 2018) and the field of Computational\n", "Social Science (Alvarez, 2016) emerges we can expect more innovation and\n", "more ideas that may help us bridge the fundamentally different\n", "characters of qualitative and quantitative research. For the moment, one\n", "particularly promising approach is to use measures derived from\n", "happenstance data (such as searches for flu) as proxy indicators for\n", "statistics that are rigorously surveilled. With the Royal Society’s\n", "DELVE initiative, examples of this approach include work of Peter Diggle\n", "to visualize the progression of the Covid-19 disease. Across the UK the\n", "“Zoe App” has been used for self-reporting of Covid-19 symptoms (Menni\n", "et al., 2020), and by interconnecting this data with Office for National\n", "Statistics surveys (Office for National Statistics, 2020), Peter has\n", "been able to calibrate the Zoe map of Covid-19 prevalence, allowing\n", "nowcasting of the disease that was validated by the production of ONS\n", "surveys. These enriched surveys can already be done without innovation\n", "to our underlying mathematical\n", "\n", "Classical statistical methodologies remain the gold-standard by which\n", "these new methodologies should be judged. The situation reminds me\n", "somewhat of the challenges Xerox faced with the advent of the computer\n", "revolution. With great prescience, Xerox realized that the advent of the\n", "computer meant that information was going to be shared more often via\n", "electronic screens. As a company whose main revenue stream was coming\n", "from photocopying documents, the notion of the paperless office\n", "represented something of a threat to Xerox. They responded by funding a\n", "research center, known as Xerox PARC. They developed many of the\n", "innovations that underpin the modern information revolution: the Xerox\n", "Alto (the first graphical user interface), the laser printer, ethernet.\n", "All of these inventions were commercial successes, but created a need\n", "for *more* paper, not less. The computers produced more information, and\n", "much of it was still shared on paper. Per capita paper consumption\n", "continued to rise in the US until it peaked at around the turn of the\n", "millennium (Andrés et al., 2014). A similar story will now apply with\n", "the advent of predictive models and data science. The increasing use of\n", "predictive methodologies does not obviate the need for confirmatory data\n", "analysis, it makes them more important than ever before.\n", "\n", "Not only is there an ongoing role for the classical methodologies we\n", "have at our disposal, it is likely to be an increasing demand in the\n", "near future. But what about new mathematical theories? How can we come\n", "to a formalism for the new approaches of mathematical data science, just\n", "as early 20th century statisticians were able to reformulate statistics\n", "on a rigorous mathematical footing?\n", "\n", "[1] Although challenges with availability of payments data within the UK\n", "meant that the researchers were able to get good assessment of the\n", "Spanish and French economies, but struggled to assess their main target,\n", "the United Kingdom.\n", "\n", "[2] Although despite conventional wisdom it appears that election polls\n", "haven’t got worse over recent years, see Jennings and Wlezien (2018)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generalization\n", "==============\n", "\n", "Machine Learning practicioners focus on out-of-sample predictive\n", "capability as their main objective. This is the ability of a model to\n", "generalize its learnings.\n", "\n", "Professor Efron’s paper does an excellent job a summarizing the range of\n", "predictive models that now lie at our disposal, but of particular\n", "interest are deep neural networks. This is because they go beyond the\n", "traditional notions of what generalization is or rather, what it has\n", "been, to practitioners on both the statistical and machine learning\n", "sides of the data sciences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Deep Models and Generalization\n", "------------------------------\n", "\n", "The new wave of predictive modelling techniques owes a great deal to the\n", "tradition of regression. But their success in generalizing to\n", "out-of-sample examples owes little to our traditional understanding of\n", "the theory of generalization. These models are highly overparameterized.\n", "As such, the traditional view would be that they should ‘overfit’ the\n", "data. But in reality, these very large models generalize well. Is it\n", "because they can’t overfit?\n", "\n", "When it comes to the mismatch between our expectations about\n", "generalization and the reality of deep models, perhaps the paper that\n", "most clearly demonstrated something was amiss was (Zhang et al., 2017),\n", "who trained a large neural network via stochastic gradient descent to\n", "label an image data set. Within Professor Efron’s categorization of\n", "regression model, such a model is a very complex regression model with a\n", "particular link function and highly structured adaptive basis functions,\n", "which are by tradition called neurons. Despite the structuring of these\n", "basis functions (known as convolutional layers), their adaptive nature\n", "means that the model contains many millions of parameters. Traditional\n", "approaches to generalization suggest that the model should over fit and\n", "(Zhang et al., 2017) proved that such models can do just that. The data\n", "they used to fit the model, the training set, was modified. They flipped\n", "labels randomly, removing any information in the data. After training,\n", "the resulting model was able to classify the training data with 100%\n", "accuracy. The experiment clearly demonstrates that all our gravest\n", "doubts about overparameterized models are true. If this model has the\n", "capacity to fit data which is obviously nonsense, then it is clearly not\n", "regularized. Our classical theories suggest that such models should not\n", "generalize well on previously unseen data, or test data, but yet the\n", "empirical experience is that they do generalize well. So, what’s going\n", "on?\n", "\n", "During a period of deploying machine learning models at Amazon, I was\n", "introduced to a set of leadership principles, fourteen different ideas\n", "to help individual Amazonians structure their thinking. One of them was\n", "called “Dive Deep”, and a key trigger for a “Dive Deep” was when\n", "anecdote and data are in conflict. If there were to be a set of Academic\n", "leadership principles, then clearly “Dive Deep” should be triggered when\n", "empirical evidence and theory are in conflict. The purpose of the\n", "principle within Amazon was to ensure people don’t depend overly on\n", "anecdotes *or* data when making their decisions, but to develop deeper\n", "understandings of their business. In academia, we are equally guilty of\n", "relying too much on empirical studies or theory without ensuring they\n", "are reconciled. The theoreticians’ disbelief of what the experimenter\n", "tells them is encapsulated in Kahnemann’s idea of “theory induced\n", "blindness” (Kahneman, 2011). Fortunately, the evidence for good\n", "generalization in these mammoth models is now large enough that the\n", "theory-blinders are falling away, and a serious look is being taken and\n", "how and why these models can generalize well.\n", "\n", "An in-depth technical understanding that applies to all these cases is\n", "not yet available. But some key ideas are. Firstly, if the neural\n", "network model is over-capacity, and can fit nonsense data in the manner\n", "demonstrated by (Zhang et al., 2017) then that immediately implies that\n", "the good generalization is arising from how the model is fitted to the\n", "data. When the number of parameters is so large, the parameters are very\n", "badly determined. In machine learning, the concept of version space\n", "(Mitchell, 1977) is the subset of all the hypotheses that are consistent\n", "with the training examples. For a neural network, the version space is\n", "where the neural network parameters (or weights) give predictions for\n", "the training data 100% accuracy. A traditional statistical perspective\n", "would eschew this regime, convinced that the implication is that\n", "overfitting must have occurred. But the empirical evidence from the deep\n", "learning community is that these regimes produce classification\n", "algorithms with excellent generalization properties. The resolution to\n", "this dilemma is *where* in the version space the algorithm comes to\n", "rest.\n", "\n", "An excellent characterization of generalization is normally given by the\n", "bias-variance dilemma. The bias-variance decomposition for regression\n", "models separates the generalization error into two components (Geman,\n", "Bienenstock, and René Doursat, 1992)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bias Variance Decomposition\n", "---------------------------\n", "\n", "The bias-variance decomposition considers the expected test error for\n", "different variations of the *training data* sampled from,\n", "$\\Pr(\\mathbf{ y}, y)$ $$\n", "\\mathbb{E}\\left[ \\left(y- f^*(\\mathbf{ y})\\right)^2 \\right].\n", "$$ This can be decomposed into two parts, $$\n", "\\mathbb{E}\\left[ \\left(y- f(\\mathbf{ y})\\right)^2 \\right] = \\text{bias}\\left[f^*(\\mathbf{ y})\\right]^2 + \\text{variance}\\left[f^*(\\mathbf{ y})\\right] +\\sigma^2,\n", "$$ where the bias is given by $$\n", " \\text{bias}\\left[f^*(\\mathbf{ y})\\right] =\n", "\\mathbb{E}\\left[f^*(\\mathbf{ y})\\right] * f(\\mathbf{ y})\n", "$$ and it summarizes error that arises from the model’s inability to\n", "represent the underlying complexity of the data. For example, if we were\n", "to model the marathon pace of the winning runner from the Olympics by\n", "computing the average pace across time, then that model would exhibit\n", "*bias* error because the reality of Olympic marathon pace is it is\n", "changing (typically getting faster).\n", "\n", "The variance term is given by $$\n", " \\text{variance}\\left[f^*(\\mathbf{ y})\\right] = \\mathbb{E}\\left[\\left(f^*(\\mathbf{ y}) - \\mathbb{E}\\left[f^*(\\mathbf{ y})\\right]\\right)^2\\right].\n", " $$ The variance term is often described as arising from a model that\n", "is too complex, but we have to be careful with this idea. Is the model\n", "really too complex relative to the real world that generates the data?\n", "The real world is a complex place, and it is rare that we are\n", "constructing mathematical models that are more complex than the world\n", "around us. Rather, the ‘too complex’ refers to ability to estimate the\n", "parameters of the model given the data we have. Slight variations in the\n", "training set cause changes in prediction.\n", "\n", "Models that exhibit high variance are sometimes said to ‘overfit’ the\n", "data whereas models that exhibit high bias are sometimes described as\n", "‘underfitting’ the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bias vs Variance Error Plots\n", "----------------------------\n", "\n", "Helper function for sampling data from two different classes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_data(per_cluster=30):\n", " \"\"\"Create a randomly sampled data set\n", " \n", " :param per_cluster: number of points in each cluster\n", " \"\"\"\n", " X = []\n", " y = []\n", " scale = 3\n", " prec = 1/(scale*scale)\n", " pos_mean = [[-1, 0],[0,0.5],[1,0]]\n", " pos_cov = [[prec, 0.], [0., prec]]\n", " neg_mean = [[0, -0.5],[0,-0.5],[0,-0.5]]\n", " neg_cov = [[prec, 0.], [0., prec]]\n", " for mean in pos_mean:\n", " X.append(np.random.multivariate_normal(mean=mean, cov=pos_cov, size=per_class))\n", " y.append(np.ones((per_class, 1)))\n", " for mean in neg_mean:\n", " X.append(np.random.multivariate_normal(mean=mean, cov=neg_cov, size=per_class))\n", " y.append(np.zeros((per_class, 1)))\n", " return np.vstack(X), np.vstack(y).flatten()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Helper function for plotting the decision boundary of the SVM." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_contours(ax, cl, xx, yy, **params):\n", " \"\"\"Plot the decision boundaries for a classifier.\n", "\n", " :param ax: matplotlib axes object\n", " :param cl: a classifier\n", " :param xx: meshgrid ndarray\n", " :param yy: meshgrid ndarray\n", " :param params: dictionary of params to pass to contourf, optional\n", " \"\"\"\n", " Z = cl.decision_function(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " # Plot decision boundary and regions\n", " out = ax.contour(xx, yy, Z, \n", " levels=[-1., 0., 1], \n", " colors='black', \n", " linestyles=['dashed', 'solid', 'dashed'])\n", " out = ax.contourf(xx, yy, Z, \n", " levels=[Z.min(), 0, Z.max()], \n", " colors=[[0.5, 1.0, 0.5], [1.0, 0.5, 0.5]])\n", " return out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib.request" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/mlai.py','mlai.py')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def decision_boundary_plot(models, X, y, axs, filename, titles, xlim, ylim):\n", " \"\"\"Plot a decision boundary on the given axes\n", " \n", " :param axs: the axes to plot on.\n", " :param models: the SVM models to plot\n", " :param titles: the titles for each axis\n", " :param X: input training data\n", " :param y: target training data\"\"\"\n", " for ax in axs.flatten():\n", " ax.clear()\n", " X0, X1 = X[:, 0], X[:, 1]\n", " if xlim is None:\n", " xlim = [X0.min()-1, X0.max()+1]\n", " if ylim is None:\n", " ylim = [X1.min()-1, X1.max()+1]\n", " xx, yy = np.meshgrid(np.arange(xlim[0], xlim[1], 0.02),\n", " np.arange(ylim[0], ylim[1], 0.02))\n", " for cl, title, ax in zip(models, titles, axs.flatten()):\n", " plot_contours(ax, cl, xx, yy,\n", " cmap=plt.cm.coolwarm, alpha=0.8)\n", " ax.plot(X0[y==1], X1[y==1], 'r.', markersize=10)\n", " ax.plot(X0[y==0], X1[y==0], 'g.', markersize=10)\n", " ax.set_xlim(xlim)\n", " ax.set_ylim(ylim)\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " ax.set_title(title)\n", " mlai.write_figure(os.path.join(filename),\n", " figure=fig,\n", " transparent=True)\n", " return xlim, ylim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib\n", "font = {'family' : 'sans',\n", " 'weight' : 'bold',\n", " 'size' : 22}\n", "\n", "matplotlib.rc('font', **font)\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import svm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create an instance of SVM and fit the data. \n", "C = 100.0 # SVM regularization parameter\n", "gammas = [0.001, 0.01, 0.1, 1]\n", "\n", "\n", "per_class=30\n", "num_samps = 20\n", "# Set-up 2x2 grid for plotting.\n", "fig, ax = plt.subplots(1, 4, figsize=(10,3))\n", "xlim=None\n", "ylim=None\n", "for samp in range(num_samps):\n", " X, y=create_data(per_class)\n", " models = []\n", " titles = []\n", " for gamma in gammas:\n", " models.append(svm.SVC(kernel='rbf', gamma=gamma, C=C))\n", " titles.append('$\\gamma={}$'.format(gamma))\n", " models = (cl.fit(X, y) for cl in models)\n", " xlim, ylim = decision_boundary_plot(models, X, y, \n", " axs=ax, \n", " filename='./ml/bias-variance{samp:0>3}.svg'.format(samp=samp), \n", " titles=titles,\n", " xlim=xlim,\n", " ylim=ylim)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install --upgrade git+https://github.com/sods/ods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('bias-variance{samp:0>3}.svg', \n", " directory='./ml', \n", " samp=IntSlider(0,0,10,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "Figure: In each figure the simpler model is on the left, and the more\n", "complex model is on the right. Each fit is done to a different version\n", "of the data set. The simpler model is more consistent in its errors\n", "(bias error), whereas the more complex model is varying in its errors\n", "(variance error)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double Descent\n", "--------------\n", "\n", "One of Breiman’s ideas for improving predictive performance is known as\n", "bagging (Breiman, 1996). The idea is to train a number of models on the\n", "data such that they overfit (high variance). Then average the\n", "predictions of these models. The models are trained on different\n", "bootstrap samples (Efron, 1979) and their predictions are aggregated\n", "giving us the acronym, Bagging. By combining decision trees with\n", "bagging, we recover random forests (Breiman, 2001b).\n", "\n", "Bias and variance can also be estimated through Professor Efron’s\n", "bootstrap (Efron, 1979), and the traditional view has been that there’s\n", "a form of Goldilocks effect, where the best predictions are given by the\n", "model that is ‘just right’ for the amount of data available. Not to\n", "simple, not too complex. The idea is that bias decreases with increasing\n", "model complexity and variance increases with increasing model\n", "complexity. Typically plots begin with the Mummy bear on the left (too\n", "much bias) end with the Daddy bear on the right (too much variance) and\n", "show a dip in the middle where the Baby bear (just) right finds\n", "themselves.\n", "\n", "The Daddy bear is typically positioned at the point where the model is\n", "able to exactly interpolate the data. For a generalized linear model\n", "(McCullagh and Nelder, 1989), this is the point at which the number of\n", "parameters is equal to the number of data[1]. But the modern empirical\n", "finding is that when we move beyond Daddy bear, into the dark forest of\n", "the massively overparameterized model we can achieve good\n", "generalization.\n", "\n", "As Zhang et al. (2017) starkly illustrated with their random labels\n", "experiment, within the dark forest there are some terrible places, big\n", "bad wolves of overfitting that will gobble up your model. But as\n", "empirical evidence shows there is also a safe and hospitable Grandma’s\n", "house where these highly overparameterized models are safely consumed.\n", "Fundamentally, it must be about the route you take through the forest,\n", "and the precautions you take to ensure the wolf doesn’t see where you’re\n", "going and beat you to the door.\n", "\n", "There are two implications of this empirical result. Firstly, that there\n", "is a great deal of new theory that needs to be developed. Secondly, that\n", "theory is now obliged to conflate two aspects to modelling that we\n", "generally like to keep separate: the model and the algorithm.\n", "\n", "Classical statistical theory around predictive generalization focusses\n", "specifically on the class of models that is being used for data fitting.\n", "Historically, whether that theory follows a Fisher-aligned estimation\n", "approach (see e.g. Vapnik (1998)) or model-based Bayesian approach (see\n", "e.g. Ghahramani (2015)), neither is fully equipped to deal with these\n", "new circumstances because, to continue our rather tortured analogy,\n", "these theories provide us with a characterization of the *destination*\n", "of the algorithm, and seek to ensure that we reach that destination.\n", "Modern machine learning requires theories of the *journey* and what our\n", "route through the forest should be.\n", "\n", "Crucially, the destination is always associated with 100% accuracy on\n", "the training set. An objective that is always achievable for the\n", "overparameterized model.\n", "\n", "Intuitively, it seems that a highly overparameterized model places\n", "Grandma’s house on the edge of the dark forest. Making it easily and\n", "quickly accessible to the algorithm. The larger the model, the more\n", "exposed Grandma’s house becomes. Perhaps this is due to some form of\n", "blessing of dimensionality brings Grandma’s house closer to the edge of\n", "the forest in a high dimensional setting. Really, we should think of\n", "Grandma’s house as a low dimensional manifold of destinations that are\n", "safe. A path through the forest where the wolf of overfitting doesn’t\n", "venture. In the GLM case, we know already that when the number of\n", "parameters matches the number of data there is precisely one location in\n", "parameter space where accuracy on the training data is 100%. Our\n", "previous misunderstanding of generalization stemmed from the fact that\n", "(seemingly) it is highly unlikely that this single point is a good place\n", "to be from the perspective of generalization. Additionally, it is often\n", "difficult to find. Finding the precise polynomial coefficients in a\n", "least squares regression to exactly fit the basis to a small data set\n", "such as the Olympic marathon data requires careful consideration of the\n", "numerical properties and an orthogonalization of the design matrix\n", "(Lawson and Hanson, 1995).\n", "\n", "It seems that with a highly overparameterized model, these locations\n", "become easier to find and they provide good generalization properties.\n", "In machine learning this is known as the “double descent phenomenon”\n", "(see e.g. Belkin et al. (2019)).\n", "\n", "As Professor Efron points out, modern machine learning models are often\n", "fitted using many millions of data points. The most extreme example of\n", "late is known as GPT-3. This neural network model, known as a\n", "Transformer, has in its largest form 175 billion parameters. The model\n", "was trained on a data set containing 499 billion tokens (about 2\n", "terabytes of text). Estimates suggest that the model costs around \\$4.5\n", "million dollars to train (see e.g. Li (2020)).\n", "\n", "[1] Assuming we are ignoring parameters in the link function and the\n", "distribution function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Empirical Effectiveness of Deep Learning\n", "----------------------------------------\n", "\n", "The OpenAI model represents just a recent example from a wave of *deep*\n", "neural network models has proved highly performant across a range of\n", "challenges that were previously seen as being beyond our statistical\n", "modelling capabilities.\n", "\n", "They stem from the courage of a group of researchers who saw that\n", "methods were improving with increasing data and chose to collect and\n", "label data sets of ever increasing size, in particular the ImageNet team\n", "led by Fei-Fei Li (Russakovsky et al., 2015) who collected a large data\n", "set of images for object detection (currently 14 million images). To\n", "make these neural network methods work on such large data sets new\n", "implementations were required. By deploying neural network training\n", "algorithms on graphics processing units (GPUs) breakthrough results were\n", "achieved on these large data sets (Krizhevsky et al., n.d.). Similar\n", "capabilities have then been shown in the domains of face identification\n", "(Taigman et al., 2014), and speech recognition (Hinton et al., 2012),\n", "translation (Sutskever et al., 2014) and language modelling (Devlin et\n", "al., 2019; Radford et al., 2019).\n", "\n", "Impressive though these performances are, they are reliant on massive\n", "data and enormous costs of training. Yet they can be seen through the\n", "lens of regression, as outlined by Professor Efron in his paper. They\n", "map from inputs to outputs. For language modelling, extensive use of\n", "auto-regression allows for sequences to be generated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "New Methods Required\n", "====================\n", "\n", "The challenges of big data emerged due to companies being able to\n", "rapidly interconnect multiple sources of information. This leads to\n", "challenges in storage, distributed processing and modeling. Google’s\n", "search engine was at the forefront of this revolution in data\n", "processing. Google researchers were among the first to notice that some\n", "tasks can be easily resolved with fairly simple models and very large\n", "data sets (Halevy et al., 2009). What we are now learning is that many\n", "tasks can be solved with complex models and even bigger data sets.\n", "\n", "While GPT-3 does an impressive job on language generation, it can do so\n", "because of the vast quantities of language data we have made available\n", "to it. What happens if we take a more complex system, for which such\n", "vast data is not available. Or, at least not available in the\n", "homogeneous form that language data can be found. Let’s take human\n", "health.\n", "\n", "Consider we take a holistic view of health and the many ways in which we\n", "can become unhealthy, through genomic and environmental effects.\n", "Firstly, let’s remember that we don’t have a full understanding, even on\n", "all the operations in a single eukaryotic cell. Indeed, we don’t even\n", "understand all the mechanisms by which transcription and translation\n", "occur in bacterial and archaeal cells. That is despite the wealth of\n", "gene expression and protein data about these cells. Even if we were to\n", "pull all this information together, would it be enough to develop that\n", "understanding?\n", "\n", "There are large quantities of data, but the complexity of the underlying\n", "systems in these domains, even the terabytes of data we have today may\n", "not be enough to determine the parameters of such complex models.\n", "\n", "\n", "\n", "Machine learning involves taking data and combining it with a model in\n", "order to make a prediction. The data consist of measurements recorded\n", "about the world around us. A model consists of our assumptions about how\n", "the data is likely to interrelate, typical assumptions include\n", "smoothness. Our assumptions reflect some underlying belief about the\n", "regularities of the universe that we expect to hold across a range of\n", "data sets. $$\n", "\\text{data} + \\text{model} \\stackrel{\\text{algorithm}}{\\rightarrow} \\text{prediction}\n", "$$ {The data and the model are combined in computation through an\n", "algorithm. The etymology of the data indicates that it is given (data\n", "comes from Latin *dare*). In some cases, for example an approach known\n", "as active learning, we have a choice as to how the data is gotten. But\n", "our main control is over the model and the algorithm.\n", "\n", "This is true for both statisticians and machine learning scientists.\n", "Although there is a difference in the core motivating philosophy. The\n", "mathematical statisticians were motivated by a desire to remove\n", "subjectivity from the analysis, reducing the problem to rigorous\n", "statistical proof. The statistician is nervous of the inductive biases\n", "humans exhibit when drawing conclusions from data. Machine learning\n", "scientists, on the other hand, sit closer to the artificial intelligence\n", "community. Traditionally, they are inspired by human inductive biases to\n", "develop algorithms that allow computers to emulate human performance on\n", "tasks. In the past I’ve summarized the situation as\n", "\n", "> Statisticians want to turn humans into computers, machine learners\n", "> want to turn computers into humans. Neither is possible so we meet\n", "> somewhere in the middle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Traditional Model-Algorithm Separation\n", "--------------------------------------\n", "\n", "Across both statistics and machine learning, the traditional view was\n", "that the modeling assumptions are the key to making good predictions.\n", "Those assumptions might include smoothness assumptions, or linearity\n", "assumptions. In some domains we might also wish to incorporate some of\n", "our mechanistic understanding of the data (see e.g. Álvarez et al.\n", "(2013)). The paradigm of model-based machine learning (Winn et al.\n", "(n.d.)), builds on the idea that the aim of machine learning is to\n", "describe one’s views about the world as accurately as possible within a\n", "model. The domain expert becomes the model-designer. The process of\n", "algorithm design is then automated to as great an extent as possible.\n", "This idea originates in the ground-breaking work of the MRC\n", "Biostatistics Unit on BUGS that dates to 1997 (see e.g. Lunn-bugs09). It\n", "is no surprise that this notion has gained most traction in the Bayesian\n", "community, because the probabilistic philosophy promises the separation\n", "of modeling and inference. As long as the probabilistic model we build\n", "is complex enough to capture the true generating process, we can\n", "separate the process of model building and probabilistic inference.\n", "Inference becomes turning the handle on the machine. Unfortunately, the\n", "handle turning in Bayesian inference involves high dimensional integrals\n", "and much of the work in this domain has focused on developing new\n", "methods of inference based around either sampling (see e.g. Carpenter et\n", "al. (2017)) or deterministic approximations (see e.g. Tran et al.\n", "(2016)).\n", "\n", "There are two principle challenges for model-based machine learning. The\n", "first is the model design challenge, and the second is the algorithm\n", "design challenge. The basic philosophy of the model-based approach is to\n", "make it as easy as possible for experts to express their ideas in a\n", "modeling language (typically probabilistic) and then automate as much as\n", "possible the algorithmic process of fitting that model to data\n", "(typically probabilistic inference).\n", "\n", "The challenge of combining that model with the data, the algorithm\n", "design challenge, is then the process of probabilistic inference.\n", "\n", "The model is a mathematical abstraction of the regularities of the\n", "universe that we believe underly the data as collected. If the model is\n", "well-chosen, we will be able to interpolate the data and predict likely\n", "values of future data points. If it is chosen badly our predictions will\n", "be overconfident and wrong.\n", "\n", "Deep learning methods conflate two aspects that we used to try to keep\n", "distinct. The mathematical model encodes our assumptions about the data.\n", "The algorithm is a set of computational instructions that combine our\n", "modeling assumptions with data to make predictions.\n", "\n", "Much of the technical focus in machine learning is on algorithms. In\n", "this document I want to retain a strong separation between the *model*\n", "and the *algorithm*. The model is a mathematical abstraction of the\n", "world that encapsulates our assumptions about the data. Normally it will\n", "depend on one or more parameters which are adaptable. The algorithm\n", "provides a procedure for adapting the model to different contexts, often\n", "through the provision of a set of data that is used for training the\n", "model.\n", "\n", "Despite the different role of model and algorithm, the two concepts are\n", "often conflated. This sometimes leads to a confused discussion. I was\n", "recently asked “Is it correct to remove the mean from the data before\n", "you do principal component analysis.” This question is about an\n", "algorithmic procedure, but the correct answer depends on what modelling\n", "assumption you are seeking to make when you are constructing your\n", "principal component analysis. Principal component analysis was\n", "originally proposed by a *model* for data by (Hotelling, 1933). It is a\n", "latent variable model that was directly inspired by work in the social\n", "sciences on factor analysis. However, much of our discussion of PCA\n", "today focusses on PCA as an algorithm. The algorithm for fitting the PCA\n", "model is to seek the eigenvectors of the covariance matrix, and people\n", "often refer to this algorithm as principal component analysis. However,\n", "that algorithm also finds the linear directions of maximum variance in\n", "the data. Seeking directions of maximum variance in the data was not the\n", "objective of Hotelling, but it is related to a challenge posed by\n", "Pearson (1901) who sought a variant of regression that predicted\n", "symmetrically regardless of which variable was considered to be the\n", "covariate and which variable the response. Coincidentally the algorithm\n", "for this model is also the eigenvector decomposition of the covariance\n", "matrix. However, the underlying model is different. The difference\n", "becomes clear when you begin to seek non-linear variants of principal\n", "component analysis. Depending on your interpretation (finding directions\n", "of maximum variance in the data or a latent variable model) the\n", "corresponding algorithm differs. For the Pearson model a valid\n", "non-linearization is kernel PCA (Schölkopf et al., 1998), but for the\n", "Hotelling model this generalization doesn’t make sense. A valid\n", "generalization of the Hotelling model is provided by the Gaussian\n", "process latent variable model (Lawrence, 2005). This confusion is often\n", "unhelpful, so for the moment we will leave algorithmic considerations to\n", "one side and focus *only* on the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is my Model Useful?\n", "-------------------\n", "\n", "In the first half of the 20th Century, the Bayesian approach was termed\n", "*inverse probability*. Fisher disliked the subjectivist nature of the\n", "approach (Aldrich, 2008) and introduced the term *Bayesian* in 1950 to\n", "distinguish these probabilities from the likelihood function (Fisher,\n", "1950). The word Bayesian has connotations of a strand of religious\n", "thinking or a cult.[1] Whether Fisher was fully aware of this when he\n", "coined the term we cannot know, but there is a faith-based-tenet that at\n", "the heart of Bayesian ideas.\n", "\n", "Many critics of the Bayesian approach ask where the Bayesian prior comes\n", "from. But one might just as well ask, where does the likelihood come\n", "from? Or where does the link function come from? All of these are\n", "subjective modeling questions. The prior is not the problem. The\n", "challenge is providing objective guarantees for our subjective model\n", "choices. The classical Bayesian approach provides guarantees, but only\n", "for the $\\mathcal{M}$-closed domain (Bernardo and Smith, 1994), where\n", "the *true* model is one of the models under consideration. This is the\n", "critical belief at the heart of the Church of Bayes: the doctrine of\n", "model correctness.\n", "\n", "The beauty of the Bayesian approach is that you don’t have to have much\n", "imagination. You work as hard as possible with your models of\n", "probability distributions to represent the problem as best you\n", "understand it, then you work as hard as possible to approximate the\n", "posterior distributions of interest and estimate the marginal\n", "likelihoods and any corresponding Bayes’s factors. If we have faith in\n", "the doctrine of model correctness, then we can pursue our modeling aims\n", "without the shadows of doubt to disturb us.\n", "\n", "Bayesian approaches have a lot in common with more traditional\n", "regularization techniques. Ridge regression imposes L2 shrinkage on the\n", "model’s weights and has an interpretation as the maximum a posteriori\n", "estimate of the Bayesian solution. For linear models the mathematics of\n", "the two approaches is strikingly similar, and they both assume stages of\n", "careful design of model/regularizer followed by either Bayesian\n", "inference or optimization. The situation with the new wave of\n", "overparameterized models is strikingly different.\n", "\n", "[1] This was pointed out to me by Bernhard Schölkopf, who by\n", "recollection credited the observation to the philosopher David Corfield." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implicit Regularization\n", "\n", "The new wave of overparameterized machine learning methods are *not*\n", "making this conceptual separation between the model and the algorithm.\n", "Deep learning approaches are simple algorithmically, but highly complex\n", "mathematically. By this I mean that it is relatively simple to write\n", "code that creates algorithms for optimizing the neural network models,\n", "particularly with the wide availability of automatic differentiation\n", "software. But analyzing what these algorithms are doing mathematically\n", "is much harder. Fortunately, insight can already be gained by\n", "considering overparameterized models in the *linear* paradigm.\n", "\n", "These analyses consider ‘gradient flow’ algorithms. A gradient flow is\n", "an idealization of the optimization where the usual stochastic gradient\n", "descent optimization (Robbins and Monro, 1951) is replaced with\n", "differential equations representing the idealized trajectory of a\n", "*batch* gradient descent approach. Under these analyses, highly\n", "overparameterized linear models can be shown to converge to the L2 norm\n", "solution (see e.g. Soudry et al. (2018)). It seems that stacking layers\n", "of networks also has interesting effects, because even when *linear*\n", "models are stacked analysis of the learning algorithm indicates a\n", "tendency towards low rank linear solutions for the parameters (Arora,\n", "Cohen, Golowich, et al., 2019). These regularization approaches are\n", "known as *implicit* regularization, because the regularization is\n", "implicit in the optimizer rather than explicitly declared by the\n", "model-designer. These theoretical insights have also proven useful in\n", "practice. Arora, Cohen, Hu, et al. (2019) show state-of-the-art\n", "performance for matrix factorization through exploiting implicit\n", "regularization in the domain of matrix factorization.\n", "\n", "Studying the implicit regularization of these modern overparameterized\n", "machine learning methods is likely to prove a fruitful avenue for\n", "theoretical investigation that will deliver deeper understanding of how\n", "we can design traditional statistical models. But the conflation of\n", "model and algorithm can make it hard to unpick what design choices are\n", "influencing which aspects of the model.\n", "\n", "While I’ve motivated much of this paper through the lens of happenstance\n", "data. But the training sets that are used with these deep learning\n", "models have striking similarities to classical data acquisition. The\n", "deep learning revolution is really a revolution of ambition in the scale\n", "of data which we are collecting. The data set which drove this paradigm\n", "shift in data collection was ImageNet (see e.g. Russakovsky et al.\n", "(2015)). It was only with the millions of labeled images available in\n", "the data that Fei-Fei Li’s team assembled that these highly\n", "overparameterized models were able to significantly differentiate\n", "themselves from traditional approaches (Krizhevsky et al., n.d.).\n", "Collection of such massive data is statistical design on a previously\n", "unprecedented scale requiring massive human effort for labeling. That\n", "doesn’t sit well with the notion of happenstance data, which by its\n", "nature is accumulating fast and is only lightly curated if at all. So,\n", "while acknowledging the importance and benefits of implicit\n", "regularization we will revert to the conceptual separation between model\n", "and algorithm as we consider what approaches might be useful for this\n", "new data landscape." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reviewing the Traditional Paradigm\n", "----------------------------------\n", "\n", "The modern Bayesian has a suite of different software available to make\n", "it easier for her to design models and perform inference in the\n", "resulting design. Given this extra time, it might depress her to reflect\n", "on the fact that the entire premise of the approach is mistaken unless\n", "you are in the $\\mathcal{M}$-closed domain. So, when do we fall outside\n", "this important domain? According to Box (1976), all the time.\n", "\n", "> All models are wrong, but some are useful\n", "\n", "Box’s important quote has become worn by overuse (like a favorite\n", "sweater). Indeed, I most often see it quoted at the beginning of a talk\n", "in a way that confuses correlation with causality. Presentations proceed\n", "in the following way. (1) Here is my model. (2) It is wrong. (3) Here is\n", "George Box’s quote. (4) My model is wrong, but it might be useful.\n", "Sometimes I feel at stage (4) a confusion about the arrow of causality\n", "occurs, it feels to me that people are almost saying “*Because* my model\n", "is wrong it *might* be useful.”\n", "\n", "Perhaps we should be more focusing on the quote from the same paper[1]\n", "\n", "> the scientist must be alert to what is importantly wrong. It is\n", "> inappropriate to be concerned about mice when there are tigers abroad.\n", ">\n", "> Box (1976)\n", "\n", "Let’s have a think about where the tigers might be in the domain of big\n", "data. To consider this, let’s first see what we can write down about our\n", "data that isn’t implicitly wrong. If we are interested in multivariate\n", "data, we could first write down our data as a *design matrix* $$\n", "\\text{data} = \\mathbf{\\mathbf{Y}} \\in \\Re^{n\\times p}.\n", "$$ Here we are assuming we have $n$ data points and $p$ features. As\n", "soon as we write down our data in this form it invites particular\n", "assumptions about the data that may have been valid in the 1930s, when\n", "there was more focus on survey data. Experimental designs stipulated a\n", "table of data with a specific set of hypotheses in mind. The data\n", "naturally sat in a matrix.\n", "\n", "The first thing that I was taught about probabilistic modeling was\n", "i.i.d. noise. And as soon as we see a matrix of data, I believe it is\n", "second nature for most of us to start writing down factorization\n", "assumptions. In particular, an independence assumption across the $n$\n", "data points, giving the likelihood function. This factorization gives\n", "both practical and theoretical advantages. It allows us to show that by\n", "optimizing the likelihood function, we are minimizing a sample-based\n", "approximation to a Kullback-Leibler divergence between our model and the\n", "true data generating density (see e.g. Wasserman (2003)). where\n", "$\\Pr(\\mathbf{Y})$ is the true data generating distribution.\n", "\n", "From pragmatist’s perspective, the assumption allows us to make\n", "predictions about new data points given a parameter vector that is\n", "derived from the training data. If the uncertainty in the system is\n", "truly independent between different data observations, then the\n", "information about the data is entirely stored in our model parameters,\n", "$\\boldsymbol{ \\theta}$.\n", "\n", "Much of classical statistics focusses on encouraging the data to conform\n", "to this independence assumption, for example through randomizing the\n", "design to distribute the influence of confounders and through selection\n", "of appropriate covariates. Or to methodologies that analyze the model\n", "fit to verify the validity of these assumptions, for example residual\n", "analysis.\n", "\n", "From a Bayesian perspective, this assumption *is* only an assumption\n", "about the nature of the residuals. It is *not* a model of the process of\n", "interest. The philosophical separation of aleatory uncertainty and\n", "epistemic uncertainty occurs here. Probability is being used only to\n", "deal with the aleatory uncertainty.\n", "\n", "In the world of happenstance data, there is insufficient influence from\n", "the model-designer on the way that the data is collected to rely on\n", "randomization. We find ourselves needing to explicitly model\n", "relationships between confounders. This requires us to be more\n", "imaginative about our probabilistic models than pure independence\n", "assumptions.\n", "\n", "An additional challenge arising from this independence assumption is the\n", "domain where the number of data features, $p$, is larger than the number\n", "of data points, $n<\n", "\n", "Figure: The most general graphical model. It makes no assumptions\n", "about conditional probability relationships between variables in the\n", "vector $\\mathbf{ y}$. It represents the unconstrained probability\n", "distribution $p(\\mathbf{ y})$.\n", "\n", "Figure gives a graphical representation of a model that’s not wrong,\n", "just not useful. I like graphical representations of probabilistic\n", "models and this is my favorite graph. It is the simplest graph but also\n", "the most general model. It says that all the data in our vector\n", "$\\mathbf{ y}$ is governed by an unspecified probability disribution\n", "$p(\\mathbf{ y})$.\n", "\n", "Graphical models normally express the conditional independence\n", "relationships in the data, with this graph we are not a priori\n", "considering any such relationships. This is the most general model (it\n", "includes all factorized models as special cases). It is not wrong, but\n", "since it doesn’t suggest what the next steps are or give us any handles\n", "on the problem it is also not useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Big Data Consistency\n", "--------------------\n", "\n", "This is the nature of modern streaming data, what has been called big\n", "data, although in the UK it looks like that term will gain a more\n", "diffuse meaning now that the government has associated a putative 189\n", "billion pounds of funding to it. But the characteristic of massive\n", "missing data is particularly obvious when we look at clinical domains.\n", "EMIS, a Yorkshire based provider of software to General Practitioners,\n", "has 39 million patient records. When we consider clinical measurements,\n", "we need to build models that not only take into account all current\n", "clinical tests, but all tests that will be invented in the future. This\n", "leads to the idea of massive missing data. The classical statistical\n", "table of data is merely the special case where someone has filled in a\n", "block of information.\n", "\n", "To deal with massively missing data we need to think about the\n", "*Kolmogorov consistency* of a process. Let me introduce Kolmogorov\n", "consistency by way of an example heard from Tony O’Hagan, but one that\n", "he credits originally to Michael Goldstein. Imagine you are on jury\n", "duty. You are asked to adjudicate on the guilt or innocence of Lord\n", "Safebury, and you are going to base your judgement on a model that is\n", "weighing all the evidence. You are just about to pronounce your decision\n", "when a maid comes running in and shouts “He didn’t do it! He didn’t do\n", "it!”. The maid wasn’t on the witness list and isn’t accounted for in\n", "your model. How does this effect your inference? The pragmatists answer\n", "might be: “not at all, because the maid wasn’t in the model.” But in the\n", "interests of justice we might want to include this information in our\n", "inference process. If, as a result of the maid’s entry, we now think it\n", "is less likely that Lord Safebury committed the crime, then necessarily\n", "every time that the (unannounced) maid doesn’t enter the room we have to\n", "assume that it is more likely that Safebury committed the crime (to\n", "ensure that the conditional probability of guilt given the maid’s\n", "evidence normalizes. But we didn’t know about the maid, so how can we\n", "account for this? Further, how can we account for all possible other\n", "surprise evidence, from the announced butlers, gardeners, chauffeurs and\n", "footmen? Kolmogorov consistency says that the net effect of\n", "marginalizing for all these potential bits of new information is null.\n", "It is a particular property of the model. Making it (only slightly) more\n", "formal, we can consider Kolmogorov consistency as a marginalization\n", "property of the model. We take the $n$ dimensional vector,\n", "$\\mathbf{ y}$, to be an (indexed) vector of all our instantiated\n", "observations of the world that we have *at the current time*. Then we\n", "take the $n^*$ dimensional vector, $\\mathbf{ y}^*$ to be the\n", "observations of the world that we are *yet to see*. From the sum rule of\n", "probability we have where the dependence of the marginal distribution\n", "for $\\mathbf{ y}$ aries from the fact that we are forming an $n^*$\n", "dimensional integral over $\\mathbf{ y}^*$. If our distribution is\n", "Kolmogorov consistent, then we know that the distribution over\n", "$\\mathbf{ y}$ is *independent* of the value of $n^*$. So, in other words\n", "$p(\\mathbf{ y}|n*)=p(\\mathbf{ y})$. Kolmogorov consistency says that the\n", "form of $p(\\mathbf{ y})$ remains the same *regardless* of the number of\n", "observations of the world that are yet to come." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parametric Models\n", "-----------------\n", "\n", "We can achieve Kolomogrov consistency almost trivially in a parametric\n", "model if we assume that the probability distribution is independent\n", "given the parameters. Then the property of being closed under\n", "marginalization is trivially satisfied through the independence, which\n", "allows us to marginalize for all future data leaving a joint\n", "distribution which isn’t dependent on $n^*$ because each future data\n", "point can be marginalized independently. But, as we’ve already argued,\n", "this involves an assumption that is often flawed in practice. It is\n", "unlikely that, in a complex model, we will be able to determine the\n", "parameter vector well enough, given limited data, for us to truly\n", "believe that all the information about the training data that is\n", "required for predicting the test data could be passed through a fixed\n", "length parameter vector. This is what this independence assumption\n", "implies. If we consider that the model will also be acquiring new data\n", "at run time, then there is the question of hot to update the parameter\n", "vector in a consistent manner, accounting for new information, e.g. new\n", "clinical results in the case of personalized health.\n", "\n", "Conversely, a general assumption about independence across *features*\n", "would lead to models which *don’t* exhibit *Komlogorov consistency*. In\n", "these models the dimensionality of the test data, $\\mathbf{ y}^*$,\n", "denoted by $n^*$ would have to be fixed and each $y^*_i$ would require\n", "marginalization. So, the nature of the test data would need to be known\n", "at model *design* time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parametric Bottleneck\n", "---------------------\n", "\n", "In practice Bayesian methods suggest placing a prior over\n", "$\\boldsymbol{\\theta}$ and using the posterior,\n", "$p(\\boldsymbol{\\theta}|\\mathbf{ y})$ for making predictions. We have a\n", "model that obeys Kolmogorov consistency, and is sophisticated enough to\n", "represent the behavior of a very comlex dataset, it may well require a\n", "large number of parameters, just like those deep learning models. One\n", "way of seeing the requirement for a large number of parameters is to\n", "look at how we are storing information from the training data to pass to\n", "the test data. The sum of all our knowledge about the training data is\n", "stored in the conditional distribution of the parameters given the\n", "training data, the Bayesian *posterior* distribution,\n", "$p(\\boldsymbol{ \\theta}| \\mathbf{ y}).$\n", "\n", "A key design-time problem is the *parametric bottleneck*. If we choose\n", "the number of parameters at design time, but the system turns out to be\n", "more complicated that we expected, we need to design a new model to deal\n", "with this complexity. The communication between the training data and\n", "the test data is like an information channel. This TT channel has a\n", "bandwidth that is restricted by our choice of the dimensionality of\n", "$\\boldsymbol{\\theta}$ at *design* time. This seems foolish. It is the\n", "bonds of this constraint that deep learning models are breaking by being\n", "so over-parameterized.\n", "\n", "To deal with this challenge in a probabilistic model, we should allow\n", "for a communication channel that is very large, or potentially infinite.\n", "In mathematical terms this implies we should look at nonparametric\n", "approaches.\n", "\n", "If highly overparameterized models are the solution for deep learning,\n", "then extending to nonparametric could be the solution to retaining the\n", "necessary sense of indeterminedness that is required to cope with very\n", "complex systems when we have seen relatively little data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Nonparametric Challenge\n", "---------------------------\n", "\n", "We have argued that we want models that are unconstrained, at design\n", "time, by a fixed bandwidth for the communication between the training\n", "data, $\\mathbf{ y}$, and the test data, $\\mathbf{ y}^*$ and that the\n", "answer is to be nonparametric. By nonparametric we are proposing using\n", "classes of models for which the conditional distribution,\n", "$p(\\mathbf{ y}^*|\\mathbf{ y})$ is not decomposable into the expectation\n", "of $p(\\mathbf{ y}^*|\\boldsymbol{ \\theta})$ under the posterior\n", "distribution of the parameters, $p(\\boldsymbol{ \\theta}|\\mathbf{ y})$\n", "for any fixed length parameter vector $\\boldsymbol{ \\theta}$. We don’t\n", "want to impose such a strong constraint on our model at *design time*.\n", "Our model may be required to be operational for many years and the true\n", "complexity of the system being modeled may not even be well understood\n", "at *design time*. We must turn to paradigms that allow us to be\n", "adaptable at *run time*. Nonparametrics provides just such a paradigm,\n", "because the effect parameter vector increases in size as we observe more\n", "data. This seems ideal, but it also presents a problem.\n", "\n", "Human beings, despite are large, interconnected brains, only have finite\n", "storage. It is estimated that we have between 100 and 1000 trillion\n", "synapses in our brains. Similar for digital computers, even the GPT-3\n", "model is restricted to 175 billion parameters. So, we need to assume\n", "that we can only store a finite number of things about the data\n", "$\\mathbf{ y}$. This seems to push us back towards nonparametric models.\n", "Here, though, we choose to go a different way. We choose to introduce a\n", "set of auxiliary variables, $\\mathbf{ u}$, which are $m$ in length.\n", "Rather than representing the nonparametric density directly, we choose\n", "to focus on storing information about $\\mathbf{ u}$. By storing\n", "information about these variables, rather than storing all the data\n", "$\\mathbf{ y}$ we hope to get around this problem. In order for us to be\n", "nonparametric about our predictions for $\\mathbf{ y}*$ we must condition\n", "on all the data, $\\mathbf{ y}$. We can’t any longer store an\n", "intermediate distribution to represent our sum knowlege,\n", "$p(\\boldsymbol{ \\theta}|\\mathbf{ y})$. Such an intermediate distribution\n", "is a finite dimensional object, and nonparametrics implies that we\n", "cannot store all the information in a finite dimensional distribution.\n", "This presents a problem for real systems in practice. We are now faced\n", "with a compromise; how can we have a distribution which is flexible\n", "enough to respond at *run time* to unforeseen complexity in the training\n", "data? Yet, simultaneously doesn’t require unbounded storage to retain\n", "all the information in the training data. We will now introduce a\n", "perspective on variational inference that will allow us to retain the\n", "advantages of both worlds. We will construct a parametric approximation\n", "to the true nonparametric conditional distribution. But, importantly,\n", "whilst this parametric approximation will suffer from the constraints on\n", "the bandwidth of the TT channel that drove us to nonparametric models in\n", "the first place, we will be able to change the number of parameters at\n", "*run time* not simply at design time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Multivariate Gaussian: Closure Under Marginalization\n", "\n", "Being closed under marginalization is a remarkable property of our old\n", "friend the multivariate Gaussian distribution (old friends often have\n", "remarkable properties that we often take for granted, I think this is\n", "particularly true for the multivariate Gaussian). In particular, if we\n", "consider a joint distribution across $p(\\mathbf{ y}, \\mathbf{ y}^*)$,\n", "then the covariance matrix of the marginal distribution for the subset\n", "of variables, $\\mathbf{ y}$, is unaffected by the length of\n", "$\\mathbf{ y}^*$. Taking this to its logical conclusion, if the length of\n", "the data, $\\mathbf{ y}$, is $n=2$. Then that implies that the covariance\n", "between $\\mathbf{ y}$, as defined by $\\mathbf{K}$, is only a $2\\times 2$\n", "matrix, and it can only depend on the indices of the two data points in\n", "$\\mathbf{ y}$. Since this covariance matrix must remain the same for any\n", "two values *regardless* of the length of $\\mathbf{ y}$ and\n", "$\\mathbf{ y}^*$ then the value of the elements of this covariance must\n", "depend only on the two indices associated with $\\mathbf{ y}$.\n", "\n", "Since the covariance matrix is specified pairwise, this implies that the\n", "covariance matrix must be dependent only on the index of the two\n", "observations $y_i$ and $y_j$ for which the covariance is being computed.\n", "In general, we can also think of this index as being infinite: it could\n", "be a spatial or temporal location. where each $y_i$ is now defined\n", "across the real line, and the dimensionality of $\\mathbf{ y}*$ is\n", "irrelevant. Prediction consists of conditioning the joint density on\n", "$\\mathbf{ y}^*$. So, for any new value of $\\mathbf{ y}^*$, given its\n", "index we compute $p(\\mathbf{ y}^* | \\mathbf{ y})$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Making Parameters Nonparametric\n", "-------------------------------\n", "\n", "We will start by introducing a set of variables, $\\mathbf{ u}$, that are\n", "finite dimensional. These variables will eventually be used to\n", "communicate information between the training and test data, i.e. across\n", "the TT channel. $$\n", "p(\\mathbf{ y}^*|\\mathbf{ y}) = \\int p(\\mathbf{ y}^*|\\mathbf{ u}) q(\\mathbf{ u}|\\mathbf{ y}) \\text{d}\\mathbf{ u}\n", "$$ where we have introduced a distribution over $\\mathbf{ u}$,\n", "$q(\\mathbf{ u}|\\mathbf{ y})$ which is not necessarily the true posterior\n", "distribution, indeed we typically derive it through a variational\n", "approximation (see e.g. Titsias (n.d.))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 2],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataVector$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"u\", r\"$\\inducingVector$\", 0.5, 1.5, fixed=False))\n", "pgm.add_edge(\"u\", \"y\", directed=False)\n", "\n", "pgm.render().figure.savefig(\"./ml/u-to-y.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Augmenting the variable space with a set of latent *inducing\n", "vectors*. The graphical model represents the factorization of the\n", "distribution into the form\n", "$\\int p(\\mathbf{ y}|\\mathbf{ u})p(\\mathbf{ u})\\text{d}\\mathbf{ u}$\n", "\n", "In Figure we have augmented our simple graphical model augmented with a\n", "vector $\\mathbf{ u}$ which we refer to as inducing variables. Note that\n", "the model is still totally general because $p(\\mathbf{ y}, \\mathbf{ u})$\n", "is an augmented variable model and the original $p(\\mathbf{ y})$ is\n", "easily recovered by simple marginalization of $\\mathbf{ u}$. So we\n", "haven’t yet made any assumptions about our data, our model is still\n", "correct, but useless.\n", "\n", "The model we’ve introduced now looks somewhat like the parametric model\n", "we argued against in the previous section, $$\n", "p(\\mathbf{ y}^* | \\mathbf{ y})=\\int\n", "p(\\mathbf{ y}^*|\\boldsymbol{ \\theta})p(\\boldsymbol{ \\theta}|\\mathbf{ y})\\text{d}\\boldsymbol{ \\theta}.\n", "$$ What’s going on here? Is there going to be some kind of\n", "parametric/nonparametric 3 card trick where with sleight of hand we are\n", "trying to introduce a parametric model? Well clearly not, because I’ve\n", "just given the game away. But I believe there are some important\n", "differences to the traditional approach for parameterizing a model.\n", "Philosophically, our variables $\\mathbf{ u}$ are variables that augment\n", "the model. We have not yet made any assumptions by introducing them.\n", "Normally, parameterization of the model instantiates assumptions, but\n", "this is not happening here. In particular note that we have *not*\n", "assumed that the training data factorize given the inducing variables.\n", "Secondly, have not specified the dimensionality of $\\mathbf{ u}$\n", "(i.e. the size of the TT channel) at *design* time. We are going to\n", "allow it to change at *run* time. We will do this by ensuring that the\n", "inducing variables also obey Kolmogorov consistency. In particular we\n", "require that If we build a joint density as follows: where $\\mathbf{ u}$\n", "are the inducing variables we choose might choose to instantiate at any\n", "given time (of dimensionality $m$) and $\\mathbf{ u}^*$ is the $m^*$\n", "dimensional pool of future inducing variables we have *not yet* chosen\n", "to instantiate (where $m^*$ could be infinite). Our new Kolmogorov\n", "consistency condition requires that $$ p(\\mathbf{ y},\n", "\\mathbf{ u}|m^*,n^*) = p(\\mathbf{ y},\n", "\\mathbf{ u}). $$ It doesn’t need to be predetermined at *design time*\n", "because we allow for the presence of a (potentially infinite) number of\n", "inducing variables $\\mathbf{ u}^*$ that we may wish to *later*\n", "instantiate to improve the quality of our model. In other words, it is\n", "very similar to the parametric approach, but now we have access to a\n", "future pool of additional parameters, $\\mathbf{ u}^*$ that we can call\n", "upon to increase the bandwidth of the TT channel as appropriate. In\n", "parametric modelling, calling upon such parameters has a significant\n", "effect on the likelihood of the model, but here these variables are\n", "auxiliary variables that will *not* affect the likelihood of the model.\n", "They merely effect our ability to approximate the true bandwidth of the\n", "TT channel. The quality of this approximation can be varied at run time.\n", "It is not necessary to specify it at design time. This gives us the\n", "flexibility we need in terms of modeling, whilst keeping computational\n", "complexity and memory demands manageable and appropriate to the task at\n", "hand." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 2],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataVector$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"u\", r\"$\\inducingVector$\", 0.5, 1.5, fixed=False))\n", "pgm.add_node(daft.Node(\"ystar\", r\"$\\dataVector^*$\", 1.5, 0.5, fixed=False))\n", "pgm.add_node(daft.Node(\"ustar\", r\"$\\inducingVector^*$\", 1.5, 1.5, fixed=False))\n", "\n", "pgm.add_edge(\"u\", \"y\", directed=False)\n", "pgm.add_edge(\"ustar\", \"y\", directed=False)\n", "pgm.add_edge(\"u\", \"ustar\", directed=False)\n", "pgm.add_edge(\"ystar\", \"y\", directed=False)\n", "pgm.add_edge(\"ustar\", \"ystar\", directed=False)\n", "pgm.add_edge(\"u\", \"ystar\", directed=False)\n", "\n", "pgm.render().figure.savefig(\"./ml/u-to-y-ustar-to-y.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: We can also augment the graphical model with data that is\n", "only seen at ‘run time’, or ‘test data’. In this case we use the\n", "superscript of $*$ to indicate the test data. This graph represents the\n", "interaction between data we’ve seen, $\\mathbf{ y}$, and data we’ve yet\n", "to see, $\\mathbf{ y}^*$ as well as the augmented variables $\\mathbf{ u}$\n", "and $\\mathbf{ u}$,\n", "$p(\\mathbf{ y}) = \\int p(\\mathbf{ y}, \\mathbf{ y}^*, \\mathbf{ u}, \\mathbf{ u}^*) \\text{d}\\mathbf{ y}^* \\text{d}\\mathbf{ u}\\text{d}\\mathbf{ u}^*$.\n", "As the fully connected graph implies we are making no assumptions about\n", "the data.\n", "\n", "Adding in the test data and the inducing variables we have not yet\n", "chosen to instantiate (Figure ). Here we see that we still haven’t\n", "defined any structure in the graph, and therefore we have not yet made\n", "any assumptions about our data. Not shown in the graph is the additional\n", "assumption that whilst $\\mathbf{ y}$ has $n$ dimensions and\n", "$\\mathbf{ u}$ has $m$ dimensions, $\\mathbf{ y}^*$ and $\\mathbf{ u}^*$\n", "are potentially infinite dimensional." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fundamental Variables\n", "\n", "To focus our model further, we assume that we observe observations,\n", "$\\mathbf{ y}$ that are derived from some underlying fundamental,\n", "$\\mathbf{ f}$, through simple factorized likelihoods. The idea of the\n", "fundamental variables is that they are sufficient to describe the world\n", "around us, but we might not be able to observe them directly. In\n", "particular we might observe relatively simple corruptions of the\n", "fundamental variables such as independent addition of noise, or\n", "thresholding. We might observe something relative about two fundamental\n", "variables. For example if we took $f_{12,345}$ to be the height of Tom\n", "Cruise and $f_{23,789}$ to be the height of Penelope Cruz then we might\n", "take for an observation a binary value indicating the relative heights,\n", "so $y_{72,394} = f_{12,345} < f_{23,789}$. The fundamental variable is\n", "an artificial construct, but it can prove to be a useful one. In\n", "particular we’d like to assume that the relationship between our\n", "observations, $\\mathbf{ y}$ and the fundamental variables, $\\mathbf{ f}$\n", "might factorize in some way. In the framework we think of this\n", "relationship, given by $p(\\mathbf{ y}|\\mathbf{ u})$ as the *likelihood*.\n", "We can ensure that assuming the likelihood factorizes does not at all\n", "reduce the generality of our model, by forcing the distribution over the\n", "fundamentals, $p(\\mathbf{ f})$ to also be Kolmogorov consistent. This\n", "ensures that in the case where the likelihood is fully factorized over\n", "$n$ the model is still general if we allow the factors of the likelihood\n", "to be Dirac delta functions suggesing that $y_i = f_i$. Since we haven’t\n", "yet specified any forms for the probability distributions this *is*\n", "allowed and therefore the formulation is still totally general. $$\n", "p(\\mathbf{ y}|n^*) = \\int p(\\mathbf{ y}|\\mathbf{ f}) p(\\mathbf{ f}, \\mathbf{ f}^*)\\text{d}\\mathbf{ f}\\text{d}\\mathbf{ f}^*\n", "$$ and since we enforce Kolmogorov consistency, we have $$\n", "p(\\mathbf{ y}|n^*) = p(\\mathbf{ y}).\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 3],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataVector$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"f\", r\"$\\mappingFunctionVector$\", 0.5, 1.5, fixed=False))\n", "pgm.add_node(daft.Node(\"u\", r\"$\\inducingVector$\", 0.5, 2.5, fixed=False))\n", "pgm.add_node(daft.Node(\"ustar\", r\"$\\inducingVector^*$\", 1.5, 2.5, fixed=False))\n", "\n", "pgm.add_edge(\"u\", \"f\", directed=False)\n", "pgm.add_edge(\"f\", \"y\")\n", "pgm.add_edge(\"ustar\", \"f\", directed=False)\n", "pgm.add_edge(\"u\", \"ustar\", directed=False)\n", "\n", "pgm.render().figure.savefig(\"./ml/u-to-f-to-y-ustar-to-f.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: We introduce the fundamental variable $\\mathbf{ f}$ which\n", "sits between $\\mathbf{ u}$ and $\\mathbf{ y}$.\n", "\n", "Now we assume some form of factorization for our data observations,\n", "$\\mathbf{ y}$, given the fundamental variables, $\\mathbf{ f}$, so that\n", "we have $$\n", "p(\\mathbf{ y}|\\mathbf{ f}) = \\prod_{i} p(\\mathbf{ y}^i| \\mathbf{ f}^i)\n", "$$ so that we have subsets of the data $\\mathbf{ y}^i$ which are\n", "dependent on subsets of the fundamental variables, $f$. For simplicity\n", "of notation we will assume a factorization across the entire data set,\n", "so each observation, $y_i$, has a single underlying fundamental\n", "variable, $f_i$, although more complex factorizations are also possible\n", "and can be considered within the analysis. $$\n", "p(\\mathbf{ y}|\\mathbf{ f}) = \\prod_{i=1}^np(y_i|f_i)\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 3],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "reduce_alpha={\"alpha\": 0.3}\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataScalar_i$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"f\", r\"$\\mappingFunction_i$\", 0.5, 1.5, fixed=False))\n", "pgm.add_node(daft.Node(\"u\", r\"$\\inducingVector$\", 0.5, 2.5, fixed=False))\n", "pgm.add_node(daft.Node(\"ustar\", r\"$\\inducingVector^*$\", 1.5, 1.5, fixed=False, plot_params=reduce_alpha))\n", "pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r\"$i=1\\dots N$\", fontsize=18)\n", "\n", "pgm.add_edge(\"u\", \"f\", directed=False)\n", "pgm.add_edge(\"f\", \"y\")\n", "pgm.add_edge(\"ustar\", \"f\", directed=False, plot_params=reduce_alpha)\n", "pgm.add_edge(\"u\", \"ustar\", directed=False, plot_params=reduce_alpha)\n", "\n", "pgm.render().figure.savefig(\"./ml/u-to-f_i-to-y_i-ustar-to-f.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: The relationship between $\\mathbf{ f}$ and $\\mathbf{ y}$ is\n", "assumed to be factorized, which we indicate here using plate notation,\n", "$p(\\mathbf{ y}) = \\int \\prod_{i=1}^np(y_i|f_i) p(\\mathbf{ f}| \\mathbf{ u}, \\mathbf{ u}^*) p(\\mathbf{ u}, \\mathbf{ u}^*)\\text{d}\\mathbf{ u}\\text{d}\\mathbf{ u}^*$.\n", "\n", "We now decompose, without loss of generality, our joint distribution\n", "over inducing variables and fundamentals into the following parts $$\n", "p(\\mathbf{ u}, \\mathbf{ f}) = p(\\mathbf{ f}|\\mathbf{ u})p(\\mathbf{ u}),\n", "$$ where we assume that we have marginalized $\\mathbf{ f}^*$ and\n", "$\\mathbf{ u}^*$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 3],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "reduce_alpha={\"alpha\": 0.3}\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataScalar_i$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"f\", r\"$\\mappingFunction_i$\", 0.5, 1.5, fixed=False))\n", "pgm.add_node(daft.Node(\"u\", r\"$\\inducingVector$\", 0.5, 2.5, fixed=False))\n", "pgm.add_node(daft.Node(\"ustar\", r\"$\\inducingVector^*$\", 1.5, 1.5, fixed=False, plot_params=reduce_alpha))\n", "pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r\"$i=1\\dots N$\", fontsize=18)\n", "\n", "pgm.add_edge(\"f\", \"y\")\n", "pgm.add_edge(\"u\", \"f\")\n", "pgm.add_edge(\"ustar\", \"f\", plot_params=reduce_alpha)\n", "\n", "pgm.render().figure.savefig(\"./ml/u-to-f_i-to-y_i.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: The model with future inducing points marginalized\n", "$p(\\mathbf{ y}) = \\int \\prod_{i=1}^np(y_i|f_i) p(\\mathbf{ f}| \\mathbf{ u}) p(\\mathbf{ u})\\text{d}\\mathbf{ u}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 3],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "reduce_alpha={\"alpha\": 0.3}\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataScalar_i$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"f\", r\"$\\mappingFunction_i$\", 0.5, 1.5, fixed=False))\n", "pgm.add_node(daft.Node(\"u\", r\"$\\inducingVector$\", 0.5, 2.5, fixed=True))\n", "pgm.add_node(daft.Node(\"ustar\", r\"$\\inducingVector^*$\", 1.5, 1.5, fixed=True, plot_params=reduce_alpha))\n", "pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r\"$i=1\\dots N$\", fontsize=18)\n", "\n", "pgm.add_edge(\"f\", \"y\")\n", "pgm.add_edge(\"u\", \"f\")\n", "pgm.add_edge(\"ustar\", \"f\", plot_params=reduce_alpha)\n", "\n", "pgm.render().figure.savefig(\"./ml/given-u-to-f_i-to-y_i.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: The model conditioned on the inducing variables\n", "$p(\\mathbf{ y}|\\mathbf{ u}, \\mathbf{ u}^*) = \\int\\prod_{i=1}^np(y_i|f_i) p(\\mathbf{ f}|\\mathbf{ u}, \\mathbf{ u}^*)\\text{d}\\mathbf{ f}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[2, 3],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "reduce_alpha={\"alpha\": 0.3}\n", "pgm.add_node(daft.Node(\"y\", r\"$\\dataScalar_i$\", 0.5, 0.5, fixed=False, observed=True))\n", "pgm.add_node(daft.Node(\"f\", r\"$\\mappingFunction_i$\", 0.5, 1.5, fixed=False))\n", "pgm.add_node(daft.Node(\"theta\", r\"$\\parameterVector$\", 0.5, 2.5, fixed=True))\n", "e, plot_params=reduce_alpha))\n", "pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r\"$i=1\\dots N$\", fontsize=18)\n", "\n", "pgm.add_edge(\"f\", \"y\")\n", "pgm.add_edge(\"theta\", \"f\")\n", "\n", "pgm.render().figure.savefig(\"./ml/given-theta-to-f_i-to-y_i.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: The model as a classical parametric model with independence\n", "across data points indexed by $i$ that is conditional on parameters\n", "$\\boldsymbol{ \\theta}$,\n", "$p(\\mathbf{ y}|\\boldsymbol{ \\theta}) = \\int\\prod_{i=1}^np(y_i|f_i) p(\\mathbf{ f}|\\boldsymbol{ \\theta})\\text{d}\\mathbf{ f}$.\n", "The model is graphically the same as the nonparametric variant but here\n", "the dimension of $\\boldsymbol{ \\theta}$ has to be fixed for Kolmogorov\n", "consistency, in the inducing vector variant the dimension of\n", "$\\mathbf{ u}$ can vary.\n", "\n", "In Figure we visualise the graphical model of a classical parametric\n", "form. This model is very general, the deep neural network models for\n", "supervised learning tasks can be seen as variants of this model with\n", "very large dimensions for the parameter vector $\\boldsymbol{ \\theta}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instantiating the Model\n", "-----------------------\n", "\n", "So far, we haven’t made *any* assumptions about the data in our model,\n", "other than a factorization assumption between the fundamental variables\n", "and the observations, $\\mathbf{ y}$. Even this assumption does not\n", "affect the generality of the model decomposition, because in the worst\n", "case the likelihood $p(\\mathbf{ y}|\\mathbf{ f})$ could be a Dirac\n", "$\\delta$ function, implying $\\mathbf{ y}=\\mathbf{ f}$ and allowing us to\n", "include complex interelations between $\\mathbf{ y}$ directly in\n", "$p(\\mathbf{ f})$. We have specified that $p(\\mathbf{ f}, \\mathbf{ u})$\n", "should be Kolmogorov consistent with $\\mathbf{ f}^*$ and $\\mathbf{ u}^*$\n", "being marginalized and we have argued that nonparametric models are\n", "important in practice to ensure that all the information in our training\n", "data can be passed to the test data.\n", "\n", "For a model to be useful, we need to specify relationships between our\n", "data variables. Of course, this is the point at which a model also\n", "typically becomes wrong. At least if our model isn’t correct, then it\n", "should be a useful abstraction of the system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gaussian Processes\n", "------------------\n", "\n", "A flexible class of models that fulfils the constraints of being\n", "non-parametric and Kolmogorov consistent is Gaussian processes. A\n", "Gaussian process prior for our fundamental variables, $\\mathbf{ f}$\n", "assumes that they are jointly Gaussian distributed. Each data point,\n", "$f_i$, is is jointly distributed with each other data point $f_j$ as a\n", "multivariate Gaussian. The covariance of this Gaussian is a function of\n", "the indices of the two data, in this case $i$ and $j$. But these indices\n", "are not just restricted to discrete values. The index can be a\n", "continuous value such as time, $t$, or spatial location, $\\mathbf{ x}$.\n", "The words index and indicate have a common etymology. This is\n", "appropriate because the index indicates the provenance of the data. In\n", "effect we have multivariate indices to account for the full provenance,\n", "so that our observations of the world are given as a function of, for\n", "example, the when, the where and the what. “When” is given by time,\n", "“where” is given by spatial location and “what” is given by a\n", "(potentially discrete) index indicating the further provenance of the\n", "data. To define a joint Gaussian density, we need to define the mean of\n", "the density and the covariance. Both this mean and the covariance also\n", "need to be indexed by the when, the where and the what." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Augmenting with Inducing Variables in Gaussian Processes\n", "--------------------------------------------------------\n", "\n", "To define our model, we need to describe the relationship between the\n", "fundamental variables, $\\mathbf{ f}$, and the inducing variables,\n", "$\\mathbf{ u}$. This needs to be done in such a way that the inducing\n", "variables are also Kolmogorov consistent. A straightforward way of\n", "achieving this is through a joint Gaussian process model over the\n", "inducing variables and the data mapping variables, so in other words we\n", "define a Gaussian process prior over $$\n", "\\begin{bmatrix}\n", "\\mathbf{ f}\\\\ \n", "\\mathbf{ u}\n", "\\end{bmatrix} \\sim \\mathcal{N}\\left(\\mathbf{m},\\mathbf{K}\\right)\n", "$$ where the covariance matrix has a block form, $$\n", "\\mathbf{K}= \\begin{bmatrix} \\mathbf{K}_{\\mathbf{ f}\\mathbf{ f}} & \\mathbf{K}_{\\mathbf{ f}\\mathbf{ u}} \\\\ \\mathbf{K}_{\\mathbf{ u}\\mathbf{ f}} & \\mathbf{K}_{\\mathbf{ u}\\mathbf{ u}}\\end{bmatrix}\n", "$$ and $\\mathbf{K}_{\\mathbf{ f}\\mathbf{ f}}$ gives the covariance\n", "between the fundamentals vector, $\\mathbf{K}_{\\mathbf{ u}\\mathbf{ u}}$\n", "gives the covariance matrix between the inducing variables and\n", "$\\mathbf{K}_{\\mathbf{ u}\\mathbf{ f}} = \\mathbf{K}_{\\mathbf{ f}\\mathbf{ u}}^\\top$\n", "gives the cross covariance between the inducing variables, $\\mathbf{ u}$\n", "and the mapping function variables, $\\mathbf{ f}$.\n", "\n", "The elements of $\\mathbf{K}_{\\mathbf{ f}\\mathbf{ f}}$ will be computed\n", "through a covariance function (or kernel) given by\n", "$k_f(\\mathbf{ x}, \\mathbf{ x}^\\prime)$ where $\\mathbf{ x}$ is a vector\n", "representing the *provenance* of the data, which as we discussed earlier\n", "could involve a spatial location, a time, or something about the nature\n", "of the data. In a Gaussian process most of the modelling decisions take\n", "place in the construction of $k_f(\\cdot)$.\n", "\n", "%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Mean Function\n", "-----------------\n", "\n", "The mean of the process is given by a vector $\\mathbf{m}$ which is\n", "derived from a mean function $m(\\mathbf{ x})$. There are many occasions\n", "when it is useful to include a mean function, but normally the mean\n", "function will have a parametric form,\n", "$m(\\mathbf{ x};\\boldsymbol{ \\theta})$, and be subject (in itself) to the\n", "same constraints that a standard parametric model has. Indeed, if we\n", "choose to model a function as a parametric form plus Gaussian noise, we\n", "can recast such a model as a simple Gaussian process with a covariance\n", "function $k_f(\\mathbf{ x}_i,\\mathbf{ x}_j) = \\sigma^2 \\delta_{i, j}$,\n", "where $\\delta_{i, j}$ is the *Kronecker* delta-function and a mean\n", "function that is given by the standard parametric form. In this case we\n", "see that the covariance function is mopping up the *residuals* that are\n", "not captured by the mean function. If we genuinely were interested in\n", "the form of a parametric mean function, as we often are in statistics,\n", "where the mean function may include a set of covariates and potential\n", "effects, often denoted by $$\n", "m(\\mathbf{ x}) = \\boldsymbol{\\beta}^\\top \\mathbf{ x},\n", "$$ where here the provenance of the data is known as the covariates, and\n", "the variable associated with $\\mathbf{ y}$ is typically known as a\n", "*response* variable. In this case the particular influence of each of\n", "the covariates is being encoded in a vector $\\boldsymbol{\\beta}$. To a\n", "statistician, the relative values of the elements of this vector are\n", "often important in making a judgement about the influence of the\n", "covariates. For example, in disease modelling the mean function might be\n", "used in a *generalized* linear model through a link function to\n", "represent a rate or risk of disease (e.g. Saul et al. (n.d.)). The\n", "covariates should *co-vary* (or move together) with the response\n", "variable. Appropriate covariates for malaria incidence rate might\n", "include known influencers of the disease. For example, if we are dealing\n", "with *malaria* then we might expect disease rates to be influenced by\n", "altitude, average temperature, average rainfall, local distribution of\n", "prophylactic measures (such as nets) etc. The covariance of the Gaussian\n", "process then has the role of taking care of the *residual* variance in\n", "the data: the data that is not explained by the mean function, i.e. the\n", "variance that cannot be explained by the parametric model. In a disease\n", "mapping model, it makes sense to assume that these residuals may not be\n", "independent. An underestimate of disease at one spatial location, may\n", "imply an underestimate of disease rates at a nearby location. The\n", "mismatch between the observed disease rate and that predicted by\n", "modeling the relationship with the covariates through the mean function\n", "is then given by the covariance function.\n", "\n", "The machine learner’s focus on prediction means that within that\n", "community the mean function is more often removed, with all the\n", "predictive power being incorporated within the Gaussian process\n", "covariance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import GPy\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.mauna_loa()\n", "kern = GPy.kern.Linear(1) + GPy.kern.RBF(1) + GPy.kern.Bias(1)\n", "model = GPy.models.GPRegression(data['X'], data['Y'], kern)\n", "#model.optimize()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pb.plot(xlim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we *could* interpret Gaussian process models as approaches to dealing\n", "with residuals\n", "\n", "%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modelling $\\mathbf{ f}$\n", "\n", "In conclusion, for a nonparametric framework, our model for\n", "$\\mathbf{ f}$ is predominantly in the covariance function\n", "$\\mathbf{K}_{\\mathbf{ f}\\mathbf{ f}}$. This is our data model. We are\n", "assuming the inducing variables are drawn from a joint Gaussian process\n", "with $\\mathbf{ f}$. The cross covariance between $\\mathbf{ u}$ and\n", "$\\mathbf{ f}$ is given by $\\mathbf{K}_{\\mathbf{ f}\\mathbf{ u}}$. This\n", "gives the relationship between the function and the inducing variables.\n", "There are a range of ways in which the inducing variables can interelate\n", "with the" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Illustrative Example\n", "\n", "For this illustrative example, we’ll consider a simple regression\n", "problem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Back to a Simple Regression Problem\n", "-----------------------------------\n", "\n", "Here we set up a simple one-dimensional regression problem. The input\n", "locations, $\\mathbf{X}$, are in two separate clusters. The response\n", "variable, $\\mathbf{ y}$, is sampled from a Gaussian process with an\n", "exponentiated quadratic covariance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install gpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import GPy\n", "from scipy import optimize\n", "np.random.seed(101)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = 50\n", "noise_var = 0.01\n", "X = np.zeros((50, 1))\n", "X[:25, :] = np.linspace(0,3,25)[:,np.newaxis] # First cluster of inputs/covariates\n", "X[25:, :] = np.linspace(7,10,25)[:,np.newaxis] # Second cluster of inputs/covariates\n", "\n", "xlim = (-2,12)\n", "ylim = (-4, 0)\n", "\n", "# Sample response variables from a Gaussian process with exponentiated quadratic covariance.\n", "k = GPy.kern.RBF(1)\n", "y = np.random.multivariate_normal(np.zeros(N),k.K(X)+np.eye(N)*np.sqrt(noise_var)).reshape(-1,1)\n", "scale = np.sqrt(np.var(y))\n", "offset = np.mean(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we perform a full Gaussian process regression on the data. We\n", "create a GP model, `m_full`, and fit it to the data, plotting the\n", "resulting fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/gp_tutorial.py','gp_tutorial.py')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from gp_tutorial import ax_default, meanplot, gpplot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_model_output(model, output_dim=0, scale=1.0, offset=0.0, ax=None, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2):\n", " if ax is None:\n", " fig, ax = plt.subplots(figsize=plot.big_figsize)\n", " ax.plot(model.X.flatten(), model.Y[:, output_dim]*scale + offset, 'r.',markersize=10)\n", " ax.set_xlabel(xlabel, fontsize=fontsize)\n", " ax.set_ylabel(ylabel, fontsize=fontsize)\n", " xt = plot.pred_range(model.X, portion=portion)\n", " yt_mean, yt_var = model.predict(xt)\n", " yt_mean = yt_mean*scale + offset\n", " yt_var *= scale*scale\n", " yt_sd=np.sqrt(yt_var)\n", " if yt_sd.shape[1]>1:\n", " yt_sd = yt_sd[:, output_dim]\n", "\n", " _ = gpplot(xt.flatten(),\n", " yt_mean[:, output_dim],\n", " yt_mean[:, output_dim]-2*yt_sd.flatten(),\n", " yt_mean[:, output_dim]+2*yt_sd.flatten(), \n", " ax=ax, fillcol='#040404', edgecol='#101010')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m_full = GPy.models.GPRegression(X,y)\n", "m_full.optimize() # Optimize parameters of covariance function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/teaching_plots.py','teaching_plots.py')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/mlai.py','mlai.py')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teaching_plots as plot\n", "import mlai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot_model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)\n", "ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(figure=fig,\n", " filename='./gp/sparse-demo-full-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: A full Gaussian process fit to the simulated data set.\n", "\n", "Now we set up the inducing variables, $\\mathbf{ u}$. Each inducing\n", "variable has its own associated input index, $\\mathbf{Z}$, which lives\n", "in the same space as $\\mathbf{X}$. Here we are using the true covariance\n", "function parameters to generate the fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kern = GPy.kern.RBF(1)\n", "Z = np.hstack(\n", " (np.linspace(2.5,4.,3),\n", " np.linspace(7,8.5,3)))[:,None]\n", "m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)\n", "m.noise_var = noise_var\n", "m.inducing_inputs.constrain_fixed()\n", "#m.tie_params('.*variance')\n", "#m.ensure_default_constraints()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(m) # why is it not printing noise variance correctly?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)\n", "ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(figure=fig,\n", " filename='./gp/sparse-demo-constrained-inducing-6-unlearned-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Sparse Gaussian process with six constrained inducing\n", "variables and parameters learned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.optimize()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)\n", "ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(figure=fig,\n", " filename='./gp/sparse-demo-constrained-inducing-6-learned-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Sparse Gaussian process with six constrained inducing\n", "variables and parameters learned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(m)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.randomize()\n", "m.inducing_inputs.unconstrain()\n", "m.optimize()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)\n", "ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(figure=fig,\n", " filename='./gp/sparse-demo-unconstrained-inducing-6-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Sparse Gaussian process with six unconstrained inducing\n", "variables, initialized randomly and then optimized.\n", "\n", "Now we will vary the number of inducing points used to form the\n", "approximation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.Z.values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.num_inducing=8\n", "m.randomize()\n", "M = 8\n", "\n", "m.set_Z(np.random.rand(M,1)*12)\n", "\n", "m.optimize()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)\n", "ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "mlai.write_figure(figure=fig,\n", " filename='./gp/sparse-demo-sparse-inducing-8-gp.svg', \n", " transparent=True, frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Sparse Gaussian process with eight inducing variables,\n", "initialized randomly and then optimized." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(m.log_likelihood(), m_full.log_likelihood())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The use of inducing variables in Gaussian process models to make\n", "inference efficient is now commonplace. By exploiting the parametric\n", "form given in Figure Hensman et al. (n.d.) were able to adapt the\n", "stochastic variational inference approach of Hoffman et al. (2012) to\n", "the nonparametric formalism. This promising direction may allow us to\n", "bridge from a rigorous probabilistic formalism for predictive modeling\n", "as enabled by nonparametric methods to the very rich modeling frameworks\n", "provided by deep learning. In particular, work in composition of\n", "Gaussian processes by Damianou and Lawrence (2013) has been extended to\n", "incorporate variational inference formalisms (see e.g. Hensman and\n", "Lawrence (2014);Dai et al. (n.d.);Salimbeni and Deisenroth (2017)). The\n", "scale at which these models can operate means that they are now being\n", "deployed in some of the domains where deep neural networks have\n", "traditionally dominated (Dutordoir et al. (2020)).\n", "\n", "These methods have not yet been fully verified on the domain which has\n", "motivated much of the thinking this paper, that of *happenstance data*.\n", "But the hope is that the rigorous probabilistic underpinnings combined\n", "with the flexibility of these methods will allow these challenges to be\n", "tackled." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion\n", "==========\n", "\n", "Modern machine learning methods for prediction are based on highly\n", "overparameterized models that have empirically performed well in tasks\n", "that were previously considered challenging or impossible such as\n", "machine translation, object detection in images, natural language\n", "generation. These models raise new questions for our thinking about how\n", "models generalize their predictions. In particular, the conflate the\n", "conceptual separation between model and algorithm and our best\n", "understanding is that they regularize themselves implicitly through\n", "their optimization algorithms.\n", "\n", "Despite the range of questions these models raise for our classical view\n", "of generalization, in another sense, these models are very traditional.\n", "They operate on tables of data that have been curated through\n", "appropriate curation. These deep learning models operate on (very large)\n", "design matrices.\n", "\n", "We’ve argued that the new frontiers for the data sciences lie in the\n", "domain of what we term “happenstance data”. The data that hasn’t been\n", "explicitly collected with a purpose in mind, but is laid down through\n", "the rhythms of our modern lives. We’ve claimed that the traditional view\n", "of data as sitting in a table is restrictive for this new domain, and\n", "outlined how we might model such data through nonparametrics.\n", "\n", "Finally, we highlighted work where these ideas are beginning to be\n", "formulated and flexible non-parametric probabilistic models are being\n", "deployed on large scale data. The next horizon for these models is to\n", "move beyond the traditional data formats, in particular tabular data, on\n", "to the domain of massivel missing data where mere snippets of data are\n", "available, but the interactions between those snippets are of sufficient\n", "complexity to require the complex modeling formalisms inspired by the\n", "modern range of deep learning methodologies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Acknowledgments\n", "---------------\n", "\n", "I’ve benefited over the years from conversations with a number of\n", "colleagues, among those I can identify that influenced the thinking in\n", "this paper are Tony O’Hagan, John Kent, David J. C. MacKay, Richard\n", "Wilkinson, Darren Wilkinson, Bernhard Schölkopf, Zoubin Ghahramani.\n", "Naturally, the responsibility for the sensible bits is theirs, the\n", "errors are all mine." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "References\n", "----------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aldrich, J., 2008. R. A. Fisher on Bayes and Bayes’ theorem. Bayesian\n", "Anal. 3, 161–170. \n", "\n", "Alvarez, R.M. (Ed.), 2016. Computational social science. Cambridge\n", "University Press.\n", "\n", "Andrés, L., Zentner, A., Zentner, J., 2014. Measuring the effect of\n", "internet adoption on paper consumption. The World Bank.\n", "\n", "Arora, S., Cohen, N., Golowich, N., Hu, W., 2019. A convergence analysis\n", "of gradient descent for deep linear neural networks, in: International\n", "Conference on Learning Representations.\n", "\n", "Arora, S., Cohen, N., Hu, W., Luo, Y., 2019. Implicit regularization in\n", "deep matrix factorization, in: Wallach, H., Larochelle, H., Beygelzimer,\n", "A., d’Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural\n", "Information Processing Systems 32. Curran Associates, Inc., pp.\n", "7413–7424.\n", "\n", "Álvarez, M.A., Luengo, D., Lawrence, N.D., 2013. Linear latent force\n", "models using Gaussian processes. IEEE Transactions on Pattern Analysis\n", "and Machine Intelligence 35, 2693–2705.\n", "\n", "\n", "Belkin, M., Hsu, D., Ma, S., Soumik Mandal, 2019. Reconciling modern\n", "machine-learning practice and the classical bias-variance trade-off.\n", "Proc. Natl. Acad. Sci. USA 116, 15849–15854.\n", "\n", "Bernardo, J.M., Smith, A.F.M., 1994. Bayesian theory. wiley.\n", "\n", "Box, G.E.P., 1976. Science and statistics. Journal of the American\n", "Statistical Association 71.\n", "\n", "Breiman, L., 2001a. Statistical modeling: The two cultures. Statistical\n", "Science 16, 199–231.\n", "\n", "Breiman, L., 2001b. Random forests. Mach. Learn. 45, 5–32.\n", "\n", "\n", "Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123–140.\n", "\n", "\n", "Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B.,\n", "Betancourt, M., Brubaker, M., Guo, J., Li, P., Allen Riddell, 2017.\n", "Stan: A probabilistic programming language. Journal of Statistical\n", "Software 76. \n", "\n", "Carvalho, V.M., Hansen, S., Ortiz, Á., García, J.R., Rodrigo, T., Mora,\n", "S.R., Ruiz, J., 2020. Tracking the covid-19 crisis with high-resolution\n", "transaction data (No. DP14642). Center for Economic Policy Research.\n", "\n", "Dai, Z., Damianou, A., Gonzalez, J., Lawrence, N.D., n.d. Variationally\n", "auto-encoded deep Gaussian processes, in:.\n", "\n", "Damianou, A., Lawrence, N.D., 2013. Deep Gaussian processes, in:. pp.\n", "207–215.\n", "\n", "Dawid, A.P., 1984. Present position and potential developments: Some\n", "personal views: Statistical theory: The prequential approach. Journal of\n", "the Royal Statistical Society, A 147, 278–292.\n", "\n", "Dawid, A.P., 1982. The well-callibrated Bayesian. Journal of the\n", "American Statistical Association 77, 605–613.\n", "\n", "Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT:\n", "Pre-training of deep bidirectional transformers for language\n", "understanding, in: Proceedings of the 2019 Conference of the North\n", "American Chapter of the Association for Computational Linguistics: Human\n", "Language Technologies, Volume 1 (Long and Short Papers). Association for\n", "Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186.\n", "\n", "\n", "Dutordoir, V., Wilk, M. van der, Artemev, A., Hensman, J., 2020.\n", "Bayesian image classification with deep convolutional gaussian\n", "processes, in: Chiappa, S., Calandra, R. (Eds.), Proceedings of Machine\n", "Learning Research. PMLR, Online, pp. 1529–1539.\n", "\n", "Efron, B., 2020. Prediction, estimation, and attribution. Journal of the\n", "American Statistical Association 115, 636–655.\n", "\n", "\n", "Efron, B., 1979. Bootstrap methods: Another look at the jackkife. Annals\n", "of Statistics 7, 1–26.\n", "\n", "Fisher, R.A., 1950. Contributions to mathematical statistics. Chapman;\n", "Hall.\n", "\n", "Friedman, J., Hastie, T., Tibshirani, R., 2020. Discussion of\n", "“Prediction, estimation, and attribution” by Bradley Efron. Journal of\n", "the American Statistical Association 115, 665–666.\n", "\n", "\n", "Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the\n", "bias/variance dilemma. Neural Computation 4, 1–58.\n", "\n", "\n", "Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the\n", "bias/variance dilema. Neural Computation 4, 1–58.\n", "\n", "Ghahramani, Z., 2015. Probabilistic machine learning and artificial\n", "intelligence. Nature 452–459.\n", "\n", "Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S.,\n", "Brilliant, L., 2009. Detecting influenza epdiemics using search engine\n", "query data. Nature 1012–1014.\n", "\n", "Halevy, A.Y., Norvig, P., Pereira, F., 2009. The unreasonable\n", "effectiveness of data. IEEE Intelligent Systems 24, 8–12.\n", "\n", "\n", "Hensman, J., Fusi, N., Lawrence, N.D., n.d. Gaussian processes for big\n", "data, in:.\n", "\n", "Hensman, J., Lawrence, N.D., 2014. Nested variational compression in\n", "deep Gaussian processes. University of Sheffield.\n", "\n", "Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-r., Jaitly, N.,\n", "Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.,\n", "2012. Deep neural networks for acoustic modeling in speech recognition:\n", "The shared views of four research groups. IEEE Signal Processing\n", "Magazine 29, 82–97. \n", "\n", "Hoffman, M., Blei, D.M., Wang, C., Paisley, J., 2012. Stochastic\n", "variational inference, arXiv preprint arXiv:1206.7051.\n", "\n", "Hotelling, H., 1933. Analysis of a complex of statistical variables into\n", "principal components. Journal of Educational Psychology 24, 417–441.\n", "\n", "Jennings, W., Wlezien, C., 2018. Election polling errors across time and\n", "space. Nature Human Behaviour 2, 276–283.\n", "\n", "\n", "Kahneman, D., 2011. Thinking fast and slow.\n", "\n", "Kohavi, R., Longbotham, R., 2017. Online controlled experiments and a/b\n", "testing, in: Sammut, C., Webb, G.I. (Eds.), Encyclopedia of Machine\n", "Learning and Data Mining. Springer US, Boston, MA, pp. 922–929.\n", "\n", "\n", "Krizhevsky, A., Sutskever, I., Hinton, G.E., n.d. ImageNet\n", "classification with deep convolutional neural networks, in:. pp.\n", "1097–1105.\n", "\n", "Lawrence, N.D., 2012. A unifying probabilistic perspective for spectral\n", "dimensionality reduction: Insights and new models. Journal of Machine\n", "Learning Research 13.\n", "\n", "Lawrence, N.D., 2010. Introduction to learning and inference in\n", "computational systems biology, in:.\n", "\n", "Lawrence, N.D., 2005. Probabilistic non-linear principal component\n", "analysis with Gaussian process latent variable models. Journal of\n", "Machine Learning Research 6, 1783–1816.\n", "\n", "Lawson, C.L., Hanson, R.J., 1995. Solving least squares problems. SIAM.\n", "\n", "\n", "Li, C., 2020. OpenAI’s gpt-3 language model: A technical overview.\n", "\n", "McCullagh, P., Nelder, J.A., 1989. Generalized linear models, 2nd ed.\n", "Chapman; Hall.\n", "\n", "Menni, C., Valdes, A.M., Freidin, M.B., Sudre, C.H., Nguyen, L.H., Drew,\n", "D.A., Ganesh, S., Varsavsky, T., Cardoso, M.J., Moustafa, J.S.E.-S.,\n", "Visconti, A., Hysi, P., Bowyer, R.C.E., Mangino, M., Falchi, M., Wolf,\n", "J., Ourselin, S., Chan, A.T., Steves, C.J., Spector, T.D., 2020.\n", "Real-time tracking of self-reported symptoms to predict potential\n", "covid-19. Nature Medicine 1037–1040.\n", "\n", "Mitchell, T.M., 1977. Version spaces: A candidate elimination approach\n", "to rule-learning (pp. 305–310), in: Proceedings of the Fifth\n", "International Joint Conference on Artificial Intelligence.\n", "\n", "Office for National Statistics, 2020. Coronavirus (covid-19) infection\n", "survey pilot: England and wales, 14 august 2020.\n", "\n", "Oliver, N., Lepri, B., Sterly, H., Lambiotte, R., Deletaille, S., De\n", "Nadai, M., Letouzé, E., Salah, A.A., Benjamins, R., Cattuto, C.,\n", "Colizza, V., Cordes, N. de, Fraiberger, S.P., Koebe, T., Lehmann, S.,\n", "Murillo, J., Pentland, A., Pham, P.N., Pivetta, F., Saramäki, J.,\n", "Scarpino, S.V., Tizzoni, M., Verhulst, S., Vinck, P., 2020. Mobile phone\n", "data for informing public health actions across the covid-19 pandemic\n", "life cycle. Science Advances 6. \n", "\n", "Pearson, K., 1901. On lines and planes of closest fit to systems of\n", "points in space. The London, Edinburgh and Dublin Philosophical Magazine\n", "and Journal of Science, Sixth Series 2, 559–572.\n", "\n", "Popper, K.R., 1963. Conjectures and refutations: The growth of\n", "scientific knowledge. Routledge, London.\n", "\n", "Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.,\n", "2019. Language models are unsupervised multitask learners, in:.\n", "\n", "Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals\n", "of Mathematical Statistics 22, 400–407.\n", "\n", "Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by\n", "locally linear embedding. Science 290, 2323–2326.\n", "\n", "\n", "Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,\n", "Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei,\n", "L., 2015. ImageNet Large Scale Visual Recognition Challenge.\n", "International Journal of Computer Vision (IJCV) 115, 211–252.\n", "\n", "\n", "Salganik, M.J., 2018. Bit by bit: Social research in the digital age.\n", "Princeton University Press.\n", "\n", "Salimbeni, H., Deisenroth, M., 2017. Doubly stochastic variational\n", "inference for deep Gaussian processes, in: Guyon, I., Luxburg, U.V.,\n", "Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R.\n", "(Eds.), Advances in Neural Information Processing Systems 30. Curran\n", "Associates, Inc., pp. 4591–4602.\n", "\n", "Saul, A.D., Hensman, J., Vehtari, A., Lawrence, N.D., n.d. Chained\n", "Gaussian processes, in:. pp. 1431–1440.\n", "\n", "Schölkopf, B., Smola, A., Müller, K.-R., 1998. Nonlinear component\n", "analysis as a kernel eigenvalue problem. Neural Computation 10,\n", "1299–1319. \n", "\n", "Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N., 2018.\n", "The implicit bias of gradient descent on separable data. Journal of\n", "Machine Learning Research 19, 1–57.\n", "\n", "Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence\n", "learning with neural networks, in: Ghahramani, Z., Welling, M., Cortes,\n", "C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural\n", "Information Processing Systems 27. Curran Associates, Inc., pp.\n", "3104–3112.\n", "\n", "Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing\n", "the gap to human-level performance in face verification, in: Proceedings\n", "of the IEEE Computer Society Conference on Computer Vision and Pattern\n", "Recognition. \n", "\n", "The DELVE Initiative, 2020. Economic aspects of the covid-19 crisis in\n", "the uk (No. 5). DELVE.\n", "\n", "Titsias, M.K., n.d. Variational learning of inducing variables in sparse\n", "Gaussian processes, in:. pp. 567–574.\n", "\n", "Tran, D., Kucukelbir, A., Dieng, A.B., Rudolph, M., Liang, D., Blei,\n", "D.M., 2016. Edward: A library for probabilistic modeling, inference, and\n", "criticism. arXiv preprint arXiv:1610.09787.\n", "\n", "Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.\n", "\n", "Vapnik, V.N., 1998. Statistical learning theory. wiley, New York.\n", "\n", "Wang, W., Rothschild, D., Goel, S., Gelman, A., 2015. Forecasting\n", "elections with non-representative polls. International Journal of\n", "Forecasting.\n", "\n", "Wasserman, L.A., 2003. All of statistics. springer, New York.\n", "\n", "Weinberger, K.Q., Sha, F., Saul, L.K., n.d. Learning a kernel matrix for\n", "nonlinear dimensionality reduction, in:. pp. 839–846.\n", "\n", "Winn, J., Bishp, C.M., Diethe, T., Guiver, J., Zaykov, Y., n.d. Model\n", "based machine learning, publisher = , year = 2019, url =\n", "http://www.mbmlbook.com/.\n", "\n", "World Health Organization, 2020. International statistical\n", "classification of diseases and related health problems (11th edition).\n", "\n", "Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2017.\n", "Understanding deep learning requires rethinking generalization, in:\n", "https://openreview.net/forum?id=Sy8gdB9xx (Ed.), International\n", "Conference on Learning Representations." ] } ], "nbformat": 4, "nbformat_minor": 5, "metadata": {} }