{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Quality and Data Readiness Levels\n", "### [Neil D. Lawrence](http://inverseprobability.com), University of Cambridge\n", "### 2019-10-15\n", "\n", "**Abstract**: In this talk we consider data readiness levels and how they may be\n", "deployed.\n", "\n", "$$\n", "\\newcommand{\\tk}[1]{}\n", "%\\newcommand{\\tk}[1]{\\textbf{TK}: #1}\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "# Introduction\n", "\n", "## Machine Learning\n", "\n", "$$\\text{data} + \\text{model} \\rightarrow \\text{prediction}$$\n", "\n", "## Code and Data Separation\n", "\n", "- Classical computer science separates code and data.\n", "- Machine learning short-circuits this separation.\n", "\n", "## The Data Crisis \\[edit\\]\n", "\n", "Anecdotally, talking to data modelling scientists. Most say they spend\n", "80% of their time acquiring and cleaning data. This is precipitating\n", "what I refer to as the \"data crisis\". This is an analogy with software.\n", "The \"software crisis\" was the phenomenon of inability to deliver\n", "software solutions due to increasing complexity of implementation. There\n", "was no single shot solution for the software crisis, it involved better\n", "practice (scrum, test orientated development, sprints, code review),\n", "improved programming paradigms (object orientated, functional) and\n", "better tools (CVS, then SVN, then git).\n", "\n", "However, these challenges aren't new, they are merely taking a different\n", "form. From the computer's perspective software *is* data. The first wave\n", "of the data crisis was known as the *software crisis*.\n", "\n", "### The Software Crisis\n", "\n", "In the late sixties early software programmers made note of the\n", "increasing costs of software development and termed the challenges\n", "associated with it as the \"[Software\n", "Crisis](https://en.wikipedia.org/wiki/Software_crisis)\". Edsger Dijkstra\n", "referred to the crisis in his 1972 Turing Award winner's address.\n", "\n", "> The major cause of the software crisis is that the machines have\n", "> become several orders of magnitude more powerful! To put it quite\n", "> bluntly: as long as there were no machines, programming was no problem\n", "> at all; when we had a few weak computers, programming became a mild\n", "> problem, and now we have gigantic computers, programming has become an\n", "> equally gigantic problem.\n", ">\n", "> Edsger Dijkstra (1930-2002), The Humble Programmer\n", "\n", "> The major cause of the data crisis is that machines have become more\n", "> interconnected than ever before. Data access is therefore cheap, but\n", "> data quality is often poor. What we need is cheap high-quality data.\n", "> That implies that we develop processes for improving and verifying\n", "> data quality that are efficient.\n", ">\n", "> There would seem to be two ways for improving efficiency. Firstly, we\n", "> should not duplicate work. Secondly, where possible we should automate\n", "> work.\n", "\n", "What I term \"The Data Crisis\" is the modern equivalent of this problem.\n", "The quantity of modern data, and the lack of attention paid to data as\n", "it is initially \"laid down\" and the costs of data cleaning are bringing\n", "about a crisis in data-driven decision making. This crisis is at the\n", "core of the challenge of *technical debt* in machine learning\n", "[@Sculley:debt15].\n", "\n", "Just as with software, the crisis is most correctly addressed by\n", "'scaling' the manner in which we process our data. Duplication of work\n", "occurs because the value of data cleaning is not correctly recognised in\n", "management decision making processes. Automation of work is increasingly\n", "possible through techniques in \"artificial intelligence\", but this will\n", "also require better management of the data science pipeline so that data\n", "about data science (meta-data science) can be correctly assimilated and\n", "processed. The Alan Turing institute has a program focussed on this\n", "area, [AI for Data\n", "Analytics](https://www.turing.ac.uk/research_projects/artificial-intelligence-data-analytics/).\n", "\n", "## Data Science as Debugging \\[edit\\]\n", "\n", "One challenge for existing information technology professionals is\n", "realizing the extent to which a software ecosystem based on data differs\n", "from a classical ecosystem. In particular, by ingesting data we bring\n", "unknowns/uncontrollables into our decision-making system. This presents\n", "opportunity for adversarial exploitation and unforeseen operation.\n", "\n", "You can also check my blog post on [\"Data Science as\n", "Debugging\"](http://inverseprobability.com/2017/03/14/data-science-as-debugging).\n", "\n", "Starting with the analysis of a data set, the nature of data science is\n", "somewhat difference from classical software engineering.\n", "\n", "One analogy I find helpful for understanding the depth of change we need\n", "is the following. Imagine as a software engineer, you find a USB stick\n", "on the ground. And for some reason you *know* that on that USB stick is\n", "a particular API call that will enable you to make a significant\n", "positive difference on a business problem. You don't know which of the\n", "many library functions on the USB stick are the ones that will help. And\n", "it could be that some of those library functions will hinder, perhaps\n", "because they are just inappropriate or perhaps because they have been\n", "placed there maliciously. The most secure thing to do would be to *not*\n", "introduce this code into your production system at all. But what if your\n", "manager told you to do so, how would you go about incorporating this\n", "code base?\n", "\n", "The answer is *very* carefully. You would have to engage in a process\n", "more akin to debugging than regular software engineering. As you\n", "understood the code base, for your work to be reproducible, you should\n", "be documenting it, not just what you discovered, but how you discovered\n", "it. In the end, you typically find a single API call that is the one\n", "that most benefits your system. But more thought has been placed into\n", "this line of code than any line of code you have written before.\n", "\n", "An enormous amount of debugging would be required. As the nature of the\n", "code base is understood, software tests to verify it also need to be\n", "constructed. At the end of all your work, the lines of software you\n", "write to actually interact with the software on the USB stick are likely\n", "to be minimal. But more thought would be put into those lines than\n", "perhaps any other lines of code in the system.\n", "\n", "Even then, when your API code is introduced into your production system,\n", "it needs to be deployed in an environment that monitors it. We cannot\n", "rely on an individual's decision making to ensure the quality of all our\n", "systems. We need to create an environment that includes quality\n", "controls, checks and bounds, tests, all designed to ensure that\n", "assumptions made about this foreign code base are remaining valid.\n", "\n", "This situation is akin to what we are doing when we incorporate data in\n", "our production systems. When we are consuming data from others, we\n", "cannot assume that it has been produced in alignment with our goals for\n", "our own systems. Worst case, it may have been adversarially produced. A\n", "further challenge is that data is dynamic. So, in effect, the code on\n", "the USB stick is evolving over time.\n", "\n", "It might see that this process is easy to formalize now, we simply need\n", "to check what the formal software engineering process is for debugging,\n", "because that is the current software engineering activity that data\n", "science is closest to. But when we look for a formalization of\n", "debugging, we find that there is none. Indeed, modern software\n", "engineering mainly focusses on ensuring that code is written without\n", "bugs in the first place.\n", "\n", "**Recommendation**: Anecdotally, resolving a machine learning challenge\n", "requires 80% of the resource to be focused on the data and perhaps 20%\n", "to be focused on the model. But many companies are too keen to employ\n", "machine learning engineers who focus on the models, not the data. We\n", "should change our hiring priorities and training. Universities cannot\n", "provide the understanding of how to data-wrangle. Companies must fill\n", "this gap.\n", "\n", "## Data Readiness Levels \\[edit\\]\n", "\n", "### Data Readiness Levels \\[edit\\]\n", "\n", "[Data Readiness\n", "Levels](http://inverseprobability.com/2017/01/12/data-readiness-levels)\n", "[@Lawrence:drl17] are an attempt to develop a language around data\n", "quality that can bridge the gap between technical solutions and decision\n", "makers such as managers and project planners. The are inspired by\n", "Technology Readiness Levels which attempt to quantify the readiness of\n", "technologies for deployment.b\n", "\n", "### Three Grades of Data Readiness \\[edit\\]\n", "\n", "Data-readiness describes, at its coarsest level, three separate stages\n", "of data graduation.\n", "\n", "- Grade C - accessibility\n", " - Transition: data becomes electronically available\n", "- Grade B - validity\n", " - Transition: pose a question to the data.\n", "- Grade A - usability\n", "\n", "The important definitions are at the transition. The move from Grade C\n", "data to Grade B data is delimited by the *electronic availability* of\n", "the data. The move from Grade B to Grade A data is delimited by posing a\n", "question or task to the data [@Lawrence:drl17].\n", "\n", "## Accessibility: Grade C\n", "\n", "The first grade refers to the accessibility of data. Most data science\n", "practitioners will be used to working with data-providers who, perhaps\n", "having had little experience of data-science before, state that they\n", "\"have the data\". More often than not, they have not verified this. A\n", "convenient term for this is \"Hearsay Data\", someone has *heard* that\n", "they have the data so they *say* they have it. This is the lowest grade\n", "of data readiness.\n", "\n", "Progressing through Grade C involves ensuring that this data is\n", "accessible. Not just in terms of digital accessiblity, but also for\n", "regulatory, ethical and intellectual property reasons.\n", "\n", "## Validity: Grade B\n", "\n", "Data transits from Grade C to Grade B once we can begin digital analysis\n", "on the computer. Once the challenges of access to the data have been\n", "resolved, we can make the data available either via API, or for direct\n", "loading into analysis software (such as Python, R, Matlab, Mathematica\n", "or SPSS). Once this has occured the data is at B4 level. Grade B\n", "involves the *validity* of the data. Does the data really represent what\n", "it purports to? There are challenges such as missing values, outliers,\n", "record duplication. Each of these needs to be investigated.\n", "\n", "Grade B and C are important as if the work done in these grades is\n", "documented well, it can be reused in other projects. Reuse of this\n", "labour is key to reducing the costs of data-driven automated decision\n", "making. There is a strong overlap between the work required in this\n", "grade and the statistical field of [*exploratory data\n", "analysis*](https://en.wikipedia.org/wiki/Exploratory_data_analysis)\n", "[@Tukey:exploratory77].\n", "\n", "The need for Grade B emerges due to the fundamental change in the\n", "availability of data. Classically, the scientific question came first,\n", "and the data came later. This is still the approach in a randomized\n", "control trial, e.g. in A/B testing or clinical trials for drugs. Today\n", "data is being laid down by happenstance, and the question we wish to ask\n", "about the data often comes after the data has been created. The Grade B\n", "of data readiness ensures thought can be put into data quality *before*\n", "the question is defined. It is this work that is reusable across\n", "multiple teams. It is these processes that the team which is *standing\n", "up* the data must deliver.\n", "\n", "## Usability: Grade A\n", "\n", "Once the validity of the data is determined, the data set can be\n", "considered for use in a particular task. This stage of data readiness is\n", "more akin to what machine learning scientists are used to doing in\n", "Universities. Bringing an algorithm to bear on a well understood data\n", "set.\n", "\n", "In Grade A we are concerned about the utility of the data given a\n", "particular task. Grade A may involve additional data collection\n", "(experimental design in statistics) to ensure that the task is\n", "fulfilled.\n", "\n", "This is the stage where the data and the model are brought together, so\n", "expertise in learning algorithms and their application is key. Further\n", "ethical considerations, such as the fairness of the resulting\n", "predictions are required at this stage. At the end of this stage a\n", "prototype model is ready for deployment.\n", "\n", "Deployment and maintenance of machine learning models in production is\n", "another important issue which Data Readiness Levels are only a part of\n", "the solution for.\n", "\n", "## Recursive Effects\n", "\n", "To find out more, or to contribute ideas go to\n", "\n", "\n", "You can also check my blog post on [\"Data Readiness\n", "Levels\"](http://inverseprobability.com/2017/01/12/data-readiness-levels).\n", "\n", "Throughout the data preparation pipeline, it is important to have close\n", "interaction between data scientists and application domain experts.\n", "Decisions on data preparation taken outside the context of application\n", "have dangerous downstream consequences. This provides an additional\n", "burden on the data scientist as they are required for each project, but\n", "it should also be seen as a learning and familiarization exercise for\n", "the domain expert. Long term, just as biologists have found it necessary\n", "to assimilate the skills of the bioinformatician to be effective in\n", "their science, most domains will also require a familiarity with the\n", "nature of data driven decision making and its application. Working\n", "closely with data-scientists on data preparation is one way to begin\n", "this sharing of best practice.\n", "\n", "The processes involved in Grade C and B are often badly taught in\n", "courses on data science. Perhaps not due to a lack of interest in the\n", "areas, but maybe more due to a lack of access to real world examples\n", "where data quality is poor.\n", "\n", "These stages of data science are also ridden with ambiguity. In the long\n", "term they could do with more formalization, and automation, but best\n", "practice needs to be understood by a wider community before that can\n", "happen.\n", "\n", "## Data Oriented Architectures \\[edit\\]\n", "\n", "In a streaming architecture we shift from management of services, to\n", "management of data streams. Instead of worrying about availability of\n", "the services we shift to worrying about the quality of the data those\n", "services are producing.\n", "\n", "## Streaming System\n", "\n", "Characteristics of a streaming system include a move from *pull* updates\n", "to *push* updates, i.e. the computation is driven by a change in the\n", "input data rather than the service calling for input data when it\n", "decides to run a computation. Streaming systems operate on 'rows' of the\n", "data rather than 'columns'. This is because the full column isn't\n", "normally available as it changes over time. As an important design\n", "principle, the services themselves are stateless, they take their state\n", "from the streaming ecosystem. This ensures the inputs and outputs of\n", "given computations are easy to declare. As a result, persistence of the\n", "data is also handled by the streaming ecosystem and decisions around\n", "data retention or recomputation can be taken at the systems level rather\n", "than the component level.\n", "\n", "## Apache Flink \\[edit\\]\n", "\n", "[Apache Flink](https://en.wikipedia.org/wiki/Apache_Flink) is a stream\n", "processing framework. Flink is a foundation for event driven processing.\n", "This gives a high throughput and low latency framework that operates on\n", "dataflows.\n", "\n", "Data storage is handled by other systems such as Apache Kafka or AWS\n", "Kinesis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stream.join(otherStream)\n", " .where()\n", " .equalTo()\n", " .window()\n", " .apply()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apache Flink allows operations on streams. For example, the join\n", "operation above. In a traditional data base management system, this join\n", "operation may be written in SQL and called on demand. In a streaming\n", "ecosystem, computations occur as and when the streams update.\n", "\n", "The join is handled by the ecosystem surrounding the business logic.\n", "\n", "## Trading System\n", "\n", "As a simple example we'll consider a high frequency trading system. Anne\n", "wishes to build a share trading system. She has access to a high\n", "frequency trading system which provides prices and allows trades at\n", "millisecond intervals. She wishes to build an automated trading system.\n", "\n", "Let's assume that price trading data is available as a data stream. But\n", "the price now is not the only information that Anne needs, she needs an\n", "estimate of the price in the future." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate an artificial trading stream\n", "days=pd.date_range(start='21/5/2017', end='21/05/2020')\n", "z = np.random.randn(len(days), 1)\n", "x = z.cumsum()+400" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prices = pd.Series(x, index=days)\n", "hypothetical = prices.loc['21/5/2019':]\n", "real = prices.copy()\n", "real['21/5/2019':] = np.NaN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Anne has access to the share prices in the black stream but\n", "not in the blue stream. A hypothetical stream is the stream of future\n", "prices. Anne can define this hypothetical under constraints (latency,\n", "input etc). The need for a model is now exposed in the software\n", "infrastructure\n", "\n", "## Hypothetical Streams\n", "\n", "We'll call the future price a hypothetical stream.\n", "\n", "A hypothetical stream is a desired stream of information which cannot be\n", "directly accessed. The lack of direct access may be because the events\n", "happen in the future, or there may be some latency between the event and\n", "the availability of the data.\n", "\n", "Any hypothetical stream will only be provided as a prediction, ideally\n", "with an error bar.\n", "\n", "The nature of the hypothetical Anne needs is dependent on her\n", "decision-making process. In Anne's case it will depend over what period\n", "she is expecting her returns. In MDOP Anne specifies a hypothetical that\n", "is derived from the pricing stream.\n", "\n", "It is not the price stream directly, but Anne looks for *future*\n", "predictions from the price stream, perhaps for price in $T$ days' time.\n", "\n", "At this stage, this stream is merely typed as a hypothetical.\n", "\n", "There are constraints on the hypothetical, they include: the *input*\n", "information, the upper limit of latency between input and prediction,\n", "and the decision Anne needs to make (how far ahead, what her upside,\n", "downside risks are). These three constraints mean that we can only\n", "recover an approximation to the hypothetical.\n", "\n", "## Hypothetical Advantage\n", "\n", "What is the advantage to defining things in this way? By defining,\n", "clearly, the two streams as real and hypothetical variants of each\n", "other, we now enable automation of the deployment and any redeployment\n", "process. The hypothetical can be *instantiated* against the real, and\n", "design criteria can be constantly evaluated triggering retraining when\n", "necessary.\n", "\n", "## Ride Sharing System\n", "\n", "\n", "\n", "Figure: Some software components in a ride allocation system. Circled\n", "components are hypothetical, rectangles represent actual data.\n", "\n", "As a second example, we'll consider a ride sharing app.\n", "\n", "Anne is on her way home now; she wishes to hail a car using a ride\n", "sharing app.\n", "\n", "The app is designed in the following way. On opening her app Anne is\n", "notified about driverss in the nearby neighborhood. She is given an\n", "estimate of the time a ride may take to come.\n", "\n", "Given this information about driver availability, Anne may feel\n", "encouraged to enter a destination. Given this destination, a price\n", "estimate can be given. This price is conditioned on other riders that\n", "may wish to go in the same direction, but the price estimate needs to be\n", "made before the user agrees to the ride.\n", "\n", "Business customer service constraints dictate that this price may not\n", "change after Anne's order is confirmed.\n", "\n", "In this simple system, several decisions are being made, each of them on\n", "the basis of a hypothetical.\n", "\n", "When Anne calls for a ride, she is provided with an estimate based on\n", "the expected time a ride can be with her. But this estimate is made\n", "without knowing where Anne wants to go. There are constraints on drivers\n", "imposed by regional boundaries, reaching the end of their shift, or\n", "their current passengers mean that this estimate can only be a best\n", "guess.\n", "\n", "This best guess may well be driven by previous data.\n", "\n", "## Ride Sharing: Service Oriented to Data Oriented \\[edit\\]\n", "\n", "\n", "\n", "Figure: Service oriented architecture. The data access is buried in\n", "the cost allocation service. Data dependencies of the service cannot be\n", "found without trawling through the underlying code base.\n", "\n", "The modern approach to software systems design is known as a\n", "*service-oriented architectures* (SOA). The idea is that software\n", "engineers are responsible for the availability and reliability of the\n", "API that accesses the service they own. Quality of service is maintained\n", "by rigorous standards around *testing* of software systems.\n", "\n", "\n", "\n", "Figure: Data oriented architecture. Now the joins and the updates are\n", "exposed within the streaming ecosystem. We can programatically determine\n", "the factor graph which gives the thread through the model.\n", "\n", "In data driven decision-making systems, the quality of decision-making\n", "is determined by the quality of the data. We need to extend the notion\n", "of *service*-oriented architecture to *data*-oriented architecture\n", "(DOA).\n", "\n", "The focus in SOA is eliminating *hard* failures. Hard failures can occur\n", "due to bugs or systems overload. This notion needs to be extended in ML\n", "systems to capture *soft failures* associated with declining data\n", "quality, incorrect modeling assumptions and inappropriate re-deployments\n", "of models. We need to focus on data quality assessments. In\n", "data-oriented architectures engineering teams are responsible for the\n", "*quality* of their output data streams in addition to the *availability*\n", "of the service they support [@Lawrence:drl17]. Quality here is not just\n", "accuracy, but fairness and explainability. This important cultural\n", "change would be capable of addressing both the challenge of *technical\n", "debt* [@Sculley:debt15] and the social responsibility of ML systems.\n", "\n", "Software development proceeds with a *test-oriented* culture. One where\n", "tests are written before software, and software is not incorporated in\n", "the wider system until all tests pass. We must apply the same standards\n", "of care to our ML systems, although for ML we need statistical tests for\n", "quality, fairness and consistency within the environment. Fortunately,\n", "the main burden of this testing need not fall to the engineers\n", "themselves: through leveraging *classical statistics* and *emulation* we\n", "will automate the creation and redeployment of these tests across the\n", "software ecosystem, we call this *ML hypervision* (WP5\n", "\\textsection \\ref{sec:hypervision}).\n", "\n", "Modern AI can be based on ML models with many millions of parameters,\n", "trained on very large data sets. In ML, strong emphasis is placed on\n", "*predictive accuracy* whereas sister-fields such as statistics have a\n", "strong emphasis on *interpretability*. ML models are said to be 'black\n", "boxes' which make decisions that are not explainable.[^1]\n", "\n", "\n", "\n", "Figure: Data-oriented programing. There is a requirement for an\n", "estimate of the driver allocation to give a rough cost estimate before\n", "the user has confirmed the ride. In data-oriented programming, this is\n", "achieved through declaring a hypothetical stream which approximates the\n", "true driver allocation, but with restricted input information and\n", "constraints on the computational latency.\n", "\n", "For the ride sharing system, we start to see a common issue with a more\n", "complex algorithmic decision-making system. Several decisions are being\n", "made multilple times. Let's look at the decisions we need along with\n", "some design criteria.\n", "\n", "1. Driver Availability: Estimate time to arrival for Anne's ride using\n", " Anne's location and local available car locations. Latency 50\n", " milliseconds\n", "2. Cost Estimate: Estimate cost for journey using Anne's destination,\n", " location and local available car current destinations and\n", " availability. Latency 50 milliseconds\n", "3. Driver Allocation: Allocate car to minimize transport cost to\n", " destination. Latency 2 seconds.\n", "\n", "So we need:\n", "\n", "1. a hypothetical to estimate availability. It is constrained by\n", " lacking destination information and a low latency requirement.\n", "2. a hypothetical to estimate cost. It is constrained by low latency\n", " requirement and\n", "\n", "Simultaneously, drivers in this data ecosystem have an app which\n", "notifies them about new jobs and recommends them where to go.\n", "\n", "Further advantages. Strategies for data retention (when to snapshot) can\n", "be set globally.\n", "\n", "A few decisions need to be made in this system. First of all, when the\n", "user opens the app, the estimate of the time to the nearest ride may\n", "need to be computed quickly, to avoid latency in the service.\n", "\n", "This may require a quick estimate of the ride availability.\n", "\n", "## Information Dynamics \\[edit\\]\n", "\n", "With all the second guessing within a complex automated decision-making\n", "system, there are potential problems with information dynamics, the\n", "'closed loop' problem, where the sub-systems are being approximated\n", "(second guessing) and predictions downstream are being affected.\n", "\n", "This leads to the need for a closed loop analysis, for example, see the\n", "[\"Closed Loop Data\n", "Science\"](https://www.gla.ac.uk/schools/computing/research/researchsections/ida-section/closedloop/)\n", "project led by Rod Murray-Smith at Glasgow.\n", "\n", "Our aim is to release our first version of a data-oriented programming\n", "environment by end of June 2019 (pending internal approval).\n", "\n", "## Conclusions\n", "\n", "- Data is modern software\n", "- We need to revisit software engineering and computer science in this\n", " context.\n", "\n", "# References {#references .unnumbered}\n", "\n", "[^1]: See for example [\"The Dark Secret at the Heart of AI\" in\n", " Technology\n", " Review](https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/)." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }