{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# What is Machine Learning?\n", "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n", "### 2019-06-03\n", "\n", "**Abstract**: In this talk we will introduce the fundamental ideas in machine\n", "learning. We'll develop our exposition around the ideas of prediction\n", "function and the objective function. We don't so much focus on the\n", "derivation of particular algorithms, but more the general principles\n", "involved to give an idea of the machine learning *landscape*.\n", "\n", "$$\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "\n", "\n", "\n", ".\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "# Introduction\n", "\n", "## Data Science Africa \\[edit\\]\n", "\n", "\n", "\n", "Figure: Data Science Africa is a\n", "ground up initiative for capacity building around data science, machine\n", "learning and artificial intelligence on the African continent.\n", "\n", "Data Science Africa is a bottom up initiative for capacity building in\n", "data science, machine learning and artificial intelligence on the\n", "African continent.\n", "\n", "As of 2019 there have been five workshops and five schools, located in\n", "Nyeri, Kenya (twice); Kampala, Uganda; Arusha, Tanzania; Abuja, Nigeria\n", "and Addis Ababa, Ethiopia. The next event is scheduled for October 2019\n", "in Accra, Ghana.\n", "\n", "The main notion is *end-to-end* data science. For example, going from\n", "data collection in the farmer's field to decision making in the Ministry\n", "of Agriculture. Or going from malaria disease counts in health centers,\n", "to medicine distribution.\n", "\n", "The philosophy is laid out in [@Lawrence:dsa15]. The key idea is that\n", "the modern *information infrastructure* presents new solutions to old\n", "problems. Modes of development change because less capital investment is\n", "required to take advantage of this infrastructure. The philosophy is\n", "that local capacity building is the right way to leverage these\n", "challenges in addressing data science problems in the African context.\n", "\n", "Data Science Africa is now a non-govermental organization registered in\n", "Kenya. The organising board of the meeting is entirely made up of\n", "scientists and academics based on the African continent.\n", "\n", "\n", "\n", "Figure: The lack of existing physical infrastructure on the African\n", "continent makes it a particularly interesting environment for deploying\n", "solutions based on the *information infrastructure*. The idea is\n", "explored more in this\n", "this\n", "Guardian Op-ed.\n", "\n", "## Example: Prediction of Malaria Incidence in Uganda \\[edit\\]\n", "\n", "[]{style=\"text-align:right\"}\n", "\n", "As an example of using Gaussian process models within the full pipeline\n", "from data to decsion, we'll consider the prediction of Malaria incidence\n", "in Uganda. For the purposes of this study malaria reports come in two\n", "forms, HMIS reports from health centres and Sentinel data, which is\n", "curated by the WHO. There are limited sentinel sites and many HMIS\n", "sites.\n", "\n", "The work is from Ricardo Andrade Pacheco's PhD thesis, completed in\n", "collaboration with John Quinn and Martin Mubangizi\n", "[@Andrade:consistent14; @Mubangizi:malaria14]. John and Martin were\n", "initally from the AI-DEV group from the University of Makerere in\n", "Kampala and more latterly they were based at UN Global Pulse in Kampala.\n", "\n", "Malaria data is spatial data. Uganda is split into districts, and health\n", "reports can be found for each district. This suggests that models such\n", "as conditional random fields could be used for spatial modelling, but\n", "there are two complexities with this. First of all, occasionally\n", "districts split into two. Secondly, sentinel sites are a specific\n", "location within a district, such as Nagongera which is a sentinel site\n", "based in the Tororo district.\n", "\n", "\n", "\n", "Figure: Ugandan districs. Data SRTM/NASA from\n", ".\n", "\n", "[[@Andrade:consistent14; @Mubangizi:malaria14]]{style=\"text-align:right\"}\n", "\n", "The common standard for collecting health data on the African continent\n", "is from the Health management information systems (HMIS). However, this\n", "data suffers from missing values [@Gething:hmis06] and diagnosis of\n", "diseases like typhoid and malaria may be confounded.\n", "\n", "\n", "\n", "Figure: The Tororo district, where the sentinel site, Nagongera, is\n", "located.\n", "\n", "[World Health Organization Sentinel Surveillance\n", "systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)\n", "are set up \"when high-quality data are needed about a particular disease\n", "that cannot be obtained through a passive system\". Several sentinel\n", "sites give accurate assessment of malaria disease levels in Uganda,\n", "including a site in Nagongera.\n", "\n", "\n", "\n", "Figure: Sentinel and HMIS data along with rainfall and temperature\n", "for the Nagongera sentinel station in the Tororo district.\n", "\n", "In collaboration with the AI Research Group at Makerere we chose to\n", "investigate whether Gaussian process models could be used to assimilate\n", "information from these two different sources of disease informaton.\n", "Further, we were interested in whether local information on rainfall and\n", "temperature could be used to improve malaria estimates.\n", "\n", "The aim of the project was to use WHO Sentinel sites, alongside rainfall\n", "and temperature, to improve predictions from HMIS data of levels of\n", "malaria.\n", "\n", "\n", "\n", "Figure: The Mubende District.\n", "\n", "\n", "\n", "Figure: Prediction of malaria incidence in Mubende.\n", "\n", "\n", "\n", "Figure: The project arose out of the Gaussian process summer school\n", "held at Makerere in Kampala in 2013. The school led, in turn, to the\n", "Data Science Africa initiative.\n", "\n", "## Early Warning Systems\n", "\n", "\n", "\n", "Figure: The Kabarole district in Uganda.\n", "\n", "\n", "\n", "Figure: Estimate of the current disease situation in the Kabarole\n", "district over time. Estimate is constructed with a Gaussian process with\n", "an additive covariance funciton.\n", "\n", "Health monitoring system for the Kabarole district. Here we have fitted\n", "the reports with a Gaussian process with an additive covariance\n", "function. It has two components, one is a long time scale component (in\n", "red above) the other is a short time scale component (in blue).\n", "\n", "Monitoring proceeds by considering two aspects of the curve. Is the blue\n", "line (the short term report signal) above the red (which represents the\n", "long term trend? If so we have higher than expected reports. If this is\n", "the case *and* the gradient is still positive (i.e. reports are going\n", "up) we encode this with a *red* color. If it is the case and the\n", "gradient of the blue line is negative (i.e. reports are going down) we\n", "encode this with an *amber* color. Conversely, if the blue line is below\n", "the red *and* decreasing, we color *green*. On the other hand if it is\n", "below red but increasing, we color *yellow*.\n", "\n", "This gives us an early warning system for disease. Red is a bad\n", "situation getting worse, amber is bad, but improving. Green is good and\n", "getting better and yellow good but degrading.\n", "\n", "Finally, there is a gray region which represents when the scale of the\n", "effect is small.\n", "\n", "\n", "\n", "Figure: The map of Ugandan districts with an overview of the Malaria\n", "situation in each district.\n", "\n", "These colors can now be observed directly on a spatial map of the\n", "districts to give an immediate impression of the current status of the\n", "disease across the country.\n", "\n", "## Machine Learning\n", "\n", "This talk is a general introduction to machine learning, we will\n", "highlight the technical challenges and the current solutions. We will\n", "give an overview of what is machine learning and why it is important.\n", "\n", "## Rise of Machine Learning\n", "\n", "Machine learning is the combination of data and models, through\n", "computation, to make predictions. $$\n", "\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}\n", "$$\n", "\n", "## Data Revolution\n", "\n", "Machine learning has risen in prominence due to the rise in data\n", "availability, and its interconnection with computers. The high bandwidth\n", "connection between data and computer leads to a new interaction between\n", "us and data via the computer. It is that channel that is being mediated\n", "by machine learning techniques.\n", "\n", "\n", "\n", "Figure: Large amounts of data and high interconnection bandwidth mean\n", "that we receive much of our information about the world around us\n", "through computers.\n", "\n", "## Machine Learning in Supply Chain \\[edit\\]\n", "\n", "\n", "\n", "Figure: Packhorse Bridge under Burbage Edge. This packhorse route\n", "climbs steeply out of Hathersage and heads towards Sheffield. Packhorses\n", "were the main route for transporting goods across the Peak District. The\n", "high cost of transport is one driver of the 'smith' model, where there\n", "is a local skilled person responsible for assembling or creating goods\n", "(e.g. a blacksmith). \n", "\n", "\n", "\n", "Figure: Richard Arkwright is regarded of the founder of the modern\n", "factory system. Factories exploit distribution networks to centralize\n", "production of goods. Arkwright located his factory in Cromford due to\n", "proximity to Nottingham Weavers (his market) and availability of water\n", "power from the tributaries of the Derwent river. When he first arrived\n", "there was almost no transportation network. Over the following 200 years\n", "The Cromford Canal (1790s), a Turnpike (now the A6, 1816-18) and the the\n", "High Peak Railway (now closed, 1820s) were all constructed to improve\n", "transportation access as the factory blossomed.\n", "\n", "## Containerization \\[edit\\]\n", "\n", "\n", "\n", "Figure: The container is one of the major drivers of globalization,\n", "and arguably the largest agent of social change in the last 100 years.\n", "It reduces the cost of transportation, significantly changing the\n", "appropriate topology of distribution networks. The container makes it\n", "possible to ship goods half way around the world for cheaper than it\n", "costs to process those goods, leading to an extended distribution\n", "topology.\n", "\n", "Containerization has had a dramatic effect on global economics, placing\n", "many people in the developing world at the end of the supply chain.\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "Figure: Wild Alaskan Cod, being solid in the Pacific Northwest, that\n", "is a product of China. It is cheaper to ship the deep frozen fish\n", "thousands of kilometers for processing than to process locally.\n", "\n", "For example, you can buy Wild Alaskan Cod fished from Alaska, processed\n", "in China, sold in North America. This is driven by the low cost of\n", "transport for frozen cod vs the higher relative cost of cod processing\n", "in the US versus China. Similarly,\n", "Scottish\n", "prawns are also processed in China for sale in the UK.\n", "\n", "This effect on cost of transport vs cost of processing is the main\n", "driver of the topology of the modern supply chain and the associated\n", "effect of globalization. If transport is much cheaper than processing,\n", "then processing will tend to agglomerate in places where processing\n", "costs can be minimized.\n", "\n", "Supply chain is a large scale automated decision making network. Our aim\n", "is to make decisions not only based on our models of customer behavior\n", "(as observed through data), but also by accounting for the structure of\n", "our fulfilment center, and delivery network.\n", "\n", "Many of the most important questions in supply chain take the form of\n", "counterfactuals. E.g. \"What would happen if we opened a manufacturing\n", "facility in Cambridge?\" A counter factual is a question that implies a\n", "mechanistic understanding of a system. It goes beyond simple smoothness\n", "assumptions or translation invariants. It requires a physical, or\n", "*mechanistic* understanding of the supply chain network. For this reason\n", "the type of models we deploy in supply chain often involve simulations\n", "or more mechanistic understanding of the network.\n", "\n", "In supply chain Machine Learning alone is not enough, we need to bridge\n", "between models that contain real mechanisms and models that are entirely\n", "data driven.\n", "\n", "This is challenging, because as we introduce more mechanism to the\n", "models we use, it becomes harder to develop efficient algorithms to\n", "match those models to data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('ncwsr1Of6Cw')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: The Supply Chain Optimization Team (SCOT) at Amazon is\n", "responsible for the automated decision making in (probably) the world's\n", "largest AI.\n", "\n", "## For Africa \\[edit\\]\n", "\n", "There is a large opportunity because infrastructures around automation\n", "are moving from physical infrastructure towards information\n", "infrastructures. How can African countries benefit from a modern\n", "information infrastructure? The aim of Data Science Africa is to answer\n", "this question, with the answers coming from the attendees.\n", "\n", "\n", "\n", "Figure: The Kapchorwa District, home district of Stephen\n", "Kiprotich.\n", "\n", "Stephen Kiprotich, the 2012 gold medal winner from the London Olympics,\n", "comes from Kapchorwa district, in eastern Uganda, near the border with\n", "Kenya.\n", "\n", "## Olympic Marathon Data \\[edit\\]\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "- Gold medal times for Olympic Marathon since 1896.\n", "- Marathons before 1924 didn't have a standardised distance.\n", "- Present results using pace per km.\n", "- In 1904 Marathon was badly organised leading to very slow times.\n", "\n", "\n", "\n", "Image from Wikimedia Commons \n", "
\n", "The first thing we will do is load a standard data set for regression\n", "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n", "winners for the Olympics from 1896 to present. First we load in the data\n", "and plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.olympic_marathon_men()\n", "x = data['X']\n", "y = data['Y']\n", "\n", "offset = y.mean()\n", "scale = np.sqrt(y.var())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import teaching_plots as plot\n", "import mlai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "xlim = (1875,2030)\n", "ylim = (2.5, 6.5)\n", "yhat = (y-offset)/scale\n", "\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "_ = ax.plot(x, y, 'r.',markersize=10)\n", "ax.set_xlabel('year', fontsize=20)\n", "ax.set_ylabel('pace min/km', fontsize=20)\n", "ax.set_xlim(xlim)\n", "ax.set_ylim(ylim)\n", "\n", "mlai.write_figure(figure=fig, \n", " filename='../slides/diagrams/datasets/olympic-marathon.svg', \n", " transparent=True, \n", " frameon=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Olympic marathon pace times since 1892.\n", "\n", "Things to notice about the data include the outlier in 1904, in this\n", "year, the olympics was in St Louis, USA. Organizational problems and\n", "challenges with dust kicked up by the cars following the race meant that\n", "participants got lost, and only very few participants completed.\n", "\n", "More recent years see more consistently quick marathons.\n", "\n", "## Polynomial Fits to Olympic Data \\[edit\\]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import mlai\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "basis = mlai.polynomial\n", "\n", "data = pods.datasets.olympic_marathon_men()\n", "\n", "x = data['X']\n", "y = data['Y']\n", "\n", "xlim = [1892, 2020]\n", "\n", "basis=mlai.Basis(mlai.polynomial, number=1, data_limits=xlim)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_LM_polynomial_number{num_basis:0>3}.svg',\n", " directory='../slides/diagrams/ml', \n", " num_basis=IntSlider(1,1,27,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Fit of a 1 degree polynomial to the olympic marathon\n", "data.\n", "\n", "\n", "\n", "Figure: Fit of a 2 degree polynomial to the olympic marathon\n", "data.\n", "\n", "## What does Machine Learning do? \\[edit\\]\n", "\n", "Any process of automation allows us to scale what we do by codifying a\n", "process in some way that makes it efficient and repeatable. Machine\n", "learning automates by emulating human (or other actions) found in data.\n", "Machine learning codifies in the form of a mathematical function that is\n", "learnt by a computer. If we can create these mathematical functions in\n", "ways in which they can interconnect, then we can also build systems.\n", "\n", "Machine learning works through codifing a prediction of interest into a\n", "mathematical function. For example, we can try and predict the\n", "probability that a customer wants to by a jersey given knowledge of\n", "their age, and the latitude where they live. The technique known as\n", "logistic regression estimates the odds that someone will by a jumper as\n", "a linear weighted sum of the features of interest.\n", "\n", "$$ \\text{odds} = \\frac{p(\\text{bought})}{p(\\text{not bought})} $$\n", "\n", "$$ \\log \\text{odds} = \\beta_0 + \\beta_1 \\text{age} + \\beta_2 \\text{latitude}.$$\n", "Here $\\beta_0$, $\\beta_1$ and $\\beta_2$ are the parameters of the model.\n", "If $\\beta_1$ and $\\beta_2$ are both positive, then the log-odds that\n", "someone will buy a jumper increase with increasing latitude and age, so\n", "the further north you are and the older you are the more likely you are\n", "to buy a jumper. The parameter $\\beta_0$ is an offset parameter, and\n", "gives the log-odds of buying a jumper at zero age and on the equator. It\n", "is likely to be negative[^1] indicating that the purchase is\n", "odds-against. This is actually a classical statistical model, and models\n", "like logistic regression are widely used to estimate probabilities from\n", "ad-click prediction to risk of disease.\n", "\n", "This is called a generalized linear model, we can also think of it as\n", "estimating the *probability* of a purchase as a nonlinear function of\n", "the features (age, lattitude) and the parameters (the $\\beta$ values).\n", "The function is known as the *sigmoid* or [logistic\n", "function](https://en.wikipedia.org/wiki/Logistic_regression), thus the\n", "name *logistic* regression.\n", "\n", "$$ p(\\text{bought}) = \\sigmoid{\\beta_0 + \\beta_1 \\text{age} + \\beta_2 \\text{latitude}}.$$\n", "In the case where we have *features* to help us predict, we sometimes\n", "denote such features as a vector, $\\inputVector$, and we then use an\n", "inner product between the features and the parameters,\n", "$\\boldsymbol{\\beta}^\\top \\inputVector = \\beta_1 \\inputScalar_1 + \\beta_2 \\inputScalar_2 + \\beta_3 \\inputScalar_3 ...$,\n", "to represent the argument of the sigmoid.\n", "\n", "$$ p(\\text{bought}) = \\sigmoid{\\boldsymbol{\\beta}^\\top \\inputVector}.$$\n", "More generally, we aim to predict some aspect of our data,\n", "$\\dataScalar$, by relating it through a mathematical function,\n", "$\\mappingFunction(\\cdot)$, to the parameters, $\\boldsymbol{\\beta}$ and\n", "the data, $\\inputVector$.\n", "\n", "$$ \\dataScalar = \\mappingFunction\\left(\\inputVector, \\boldsymbol{\\beta}\\right).$$\n", "We call $\\mappingFunction(\\cdot)$ the *prediction function*.\n", "\n", "To obtain the fit to data, we use a separate function called the\n", "*objective function* that gives us a mathematical representation of the\n", "difference between our predictions and the real data.\n", "\n", "$$\\errorFunction(\\boldsymbol{\\beta}, \\dataMatrix, \\inputMatrix)$$ A\n", "commonly used examples (for example in a regression problem) is least\n", "squares,\n", "$$\\errorFunction(\\boldsymbol{\\beta}, \\dataMatrix, \\inputMatrix) = \\sum_{i=1}^\\numData \\left(\\dataScalar_i - \\mappingFunction(\\inputVector_i, \\boldsymbol{\\beta})\\right)^2.$$\n", "\n", "If a linear prediction function is combined with the least squares\n", "objective function then that gives us a classical *linear regression*,\n", "another classical statistical model. Statistics often focusses on linear\n", "models because it makes interpretation of the model easier.\n", "Interpretation is key in statistics because the aim is normally to\n", "validate questions by analysis of data. Machine learning has typically\n", "focussed more on the prediction function itself and worried less about\n", "the interpretation of parameters, which are normally denoted by\n", "$\\mathbf{w}$ instead of $\\boldsymbol{\\beta}$. As a result *non-linear*\n", "functions are explored more often as they tend to improve quality of\n", "predictions but at the expense of interpretability.\n", "\n", "## What is Machine Learning? \\[edit\\]\n", "\n", "Machine learning allows us to extract knowledge from data to form a\n", "prediction.\n", "\n", "$$\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", "A machine learning prediction is made by combining a model with data to\n", "form the prediction. The manner in which this is done gives us the\n", "machine learning *algorithm*.\n", "\n", "Machine learning models are *mathematical models* which make weak\n", "assumptions about data, e.g. smoothness assumptions. By combining these\n", "assumptions with the data we observe we can interpolate between data\n", "points or, occasionally, extrapolate into the future.\n", "\n", "Machine learning is a technology which strongly overlaps with the\n", "methodology of statistics. From a historical/philosophical view point,\n", "machine learning differs from statistics in that the focus in the\n", "machine learning community has been primarily on accuracy of prediction,\n", "whereas the focus in statistics is typically on the interpretability of\n", "a model and/or validating a hypothesis through data collection.\n", "\n", "The rapid increase in the availability of compute and data has led to\n", "the increased prominence of machine learning. This prominence is\n", "surfacing in two different, but overlapping domains: data science and\n", "artificial intelligence.\n", "\n", "The real challenge, however, is end-to-end decision making. Taking\n", "information from the enviroment and using it to drive decision making to\n", "achieve goals.\n", "\n", "## Artificial Intelligence and Data Science \\[edit\\]\n", "\n", "Artificial intelligence has the objective of endowing computers with\n", "human-like intelligent capabilities. For example, understanding an image\n", "(computer vision) or the contents of some speech (speech recognition),\n", "the meaning of a sentence (natural language processing) or the\n", "translation of a sentence (machine translation).\n", "\n", "### Supervised Learning for AI\n", "\n", "The machine learning approach to artificial intelligence is to collect\n", "and annotate a large data set from humans. The problem is characterized\n", "by input data (e.g. a particular image) and a label (e.g. is there a car\n", "in the image yes/no). The machine learning algorithm fits a mathematical\n", "function (I call this the *prediction function*) to map from the input\n", "image to the label. The parameters of the prediction function are set by\n", "minimizing an error between the function's predictions and the true\n", "data. This mathematical function that encapsulates this error is known\n", "as the *objective function*.\n", "\n", "This approach to machine learning is known as *supervised learning*.\n", "Various approaches to supervised learning use different prediction\n", "functions, objective functions or different optimization algorithms to\n", "fit them.\n", "\n", "For example, *deep learning* makes use of *neural networks* to form the\n", "predictions. A neural network is a particular type of mathematical\n", "function that allows the algorithm designer to introduce invariances\n", "into the function.\n", "\n", "An invariance is an important way of including prior understanding in a\n", "machine learning model. For example, in an image, a car is still a car\n", "regardless of whether it's in the upper left or lower right corner of\n", "the image. This is known as translation invariance. A neural network\n", "encodes translation invariance in *convolutional layers*. Convolutional\n", "neural networks are widely used in image recognition tasks.\n", "\n", "An alternative structure is known as a recurrent neural network (RNN).\n", "RNNs neural networks encode temporal structure. They use auto regressive\n", "connections in their hidden layers, they can be seen as time series\n", "models which have non-linear auto-regressive basis functions. They are\n", "widely used in speech recognition and machine translation.\n", "\n", "Machine learning has been deployed in Speech Recognition (e.g. Alexa,\n", "deep neural networks, convolutional neural networks for speech\n", "recognition), in computer vision (e.g. Amazon Go, convolutional neural\n", "networks for person recognition and pose detection).\n", "\n", "The field of data science is related to AI, but philosophically\n", "different. It arises because we are increasingly creating large amounts\n", "of data through *happenstance* rather than active collection. In the\n", "modern era data is laid down by almost all our activities. The objective\n", "of data science is to extract insights from this data.\n", "\n", "Classically, in the field of statistics, data analysis proceeds by\n", "assuming that the question (or scientific hypothesis) comes before the\n", "data is created. E.g., if I want to determine the effectiveness of a\n", "particular drug I perform a *design* for my data collection. I use\n", "foundational approaches such as randomization to account for\n", "confounders. This made a lot of sense in an era where data had to be\n", "actively collected. The reduction in cost of data collection and storage\n", "now means that many data sets are available which weren't collected with\n", "a particular question in mind. This is a challenge because bias in the\n", "way data was acquired can corrupt the insights we derive. We can perform\n", "randomized control trials (or A/B tests) to verify our conclusions, but\n", "the opportunity is to use data science techniques to better guide our\n", "question selection or even answer a question without the expense of a\n", "full randomized control trial (referred to as A/B testing in modern\n", "internet parlance).\n", "\n", "## Neural Networks and Prediction Functions \\[edit\\]\n", "\n", "Neural networks are adaptive non-linear function models. Originally,\n", "they were studied (by McCulloch and Pitts [@McCulloch:neuron43]) as\n", "simple models for neurons, but over the last decade they have become\n", "popular because they are a flexible approach to modelling complex data.\n", "A particular characteristic of neural network models is that they can be\n", "composed to form highly complex functions which encode many of our\n", "expectations of the real world. They allow us to encode our assumptions\n", "about how the world works.\n", "\n", "We will return to composition later, but for the moment, let's focus on\n", "a one hidden layer neural network. We are interested in the prediction\n", "function, so we'll ignore the objective function (which is often called\n", "an error function) for the moment, and just describe the mathematical\n", "object of interest\n", "\n", "$$\n", "\\mappingFunction(\\inputVector) = \\mappingMatrix^\\top \\activationVector(\\mappingMatrixTwo, \\inputVector)\n", "$$\n", "\n", "Where in this case $\\mappingFunction(\\cdot)$ is a scalar function with\n", "vector inputs, and $\\activationVector(\\cdot)$ is a vector function with\n", "vector inputs. The dimensionality of the vector function is known as the\n", "number of hidden units, or the number of neurons. The elements of this\n", "vector function are known as the *activation* function of the neural\n", "network and $\\mappingMatrixTwo$ are the parameters of the activation\n", "functions.\n", "\n", "## Relations with Classical Statistics\n", "\n", "In statistics activation functions are traditionally known as *basis\n", "functions*. And we would think of this as a *linear model*. It's doesn't\n", "make linear predictions, but it's linear because in statistics\n", "estimation focuses on the parameters, $\\mappingMatrix$, not the\n", "parameters, $\\mappingMatrixTwo$. The linear model terminology refers to\n", "the fact that the model is *linear in the parameters*, but it is *not*\n", "linear in the data unless the activation functions are chosen to be\n", "linear.\n", "\n", "## Adaptive Basis Functions\n", "\n", "The first difference in the (early) neural network literature to the\n", "classical statistical literature is the decision to optimize these\n", "parameters, $\\mappingMatrixTwo$, as well as the parameters,\n", "$\\mappingMatrix$ (which would normally be denoted in statistics by\n", "$\\boldsymbol{\\beta}$)[^2].\n", "\n", "## Machine Learning\n", "\n", "The key idea in machine learning is to observe the system in practice,\n", "and then emulate its behavior with mathematics. That leads to a design\n", "challenge as to where to place the mathematical function. The placement\n", "of the mathematical function leads to the different domains of machine\n", "learning.\n", "\n", "1. Supervised learning\n", "2. Unsupervised learning\n", "3. Reinforcement learning\n", "\n", "# Supervised Learning \\[edit\\]\n", "\n", "Supervised learning is one of the most widely deployed machine learning\n", "technologies, and a particular domain of success has been\n", "*classification*. Classification is the process of taking an input\n", "(which might be an image) and categorizing it into one of a number of\n", "different classes (e.g. dog or cat). This simple idea underpins a lot of\n", "machine learning. By scanning across the image we can also determine\n", "where the animal is in the image.\n", "\n", "\n", "\n", "Figure: The perceptron algorithm.\n", "\n", "Classification is perhaps the technique most closely assocated with\n", "machine learning. In the speech based agents, on-device classifiers are\n", "used to determine when the wake word is used. A wake word is a word that\n", "wakes up the device. For the Amazon Echo it is \"Alexa\", for Siri it is\n", "\"Hey Siri\". Once the wake word detected with a classifier, the speech\n", "can be uploaded to the cloud for full processing, the speech recognition\n", "stages.\n", "\n", "A major breakthrough in image classification came in 2012 with the\n", "ImageNet result of [Alex Krizhevsky, Ilya Sutskever and Geoff\n", "Hinton](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-)\n", "from the University of Toronto. ImageNet is a large data base of 14\n", "million images with many thousands of classes. The data is used in a\n", "community-wide challenge for object categorization. Krizhevsky et al\n", "used convolutional neural networks to outperform all previous approaches\n", "on the challenge. They formed a company which was purchased shortly\n", "after by Google. This challenge, known as object categorisation, was a\n", "major obstacle for practical computer vision systems. Modern object\n", "categorization systems are close to human performance.\n", "\n", "In supervised learning the inputs, $\\inputVector$, are mapped to a\n", "label, $\\dataScalar$, through a function $\\mappingFunction(\\cdot)$ that\n", "is dependent on a set of parameters, $\\weightVector$, $$\n", "\\dataScalar = \\mappingFunction(\\inputVector; \\weightVector).\n", "$$ The function $\\mappingFunction(\\cdot)$ is known as the *prediction\n", "function*. The key challenges are (1) choosing which features,\n", "$\\inputVector$, are relevant in the prediction, (2) defining the\n", "appropriate *class of function*, $\\mappingFunction(\\cdot)$, to use and\n", "(3) selecting the right parameters, $\\weightVector$.\n", "\n", "## Supervised Learning Challenges\n", "\n", "There are three principal challenges in constructing a problem for\n", "supervised learning.\n", "\n", "1. choosing which features, $\\inputVector$, are relevant in the\n", " prediction\n", "2. defining the appropriate *class of function*,\n", " $\\mappingFunction(\\cdot)$.\n", "3. selecting the right parameters, $\\weightVector$.\n", "\n", "## Feature Selection\n", "\n", "Feature selection is a critical stage in the algorithm design process.\n", "In the Olympic prediction example above we're only using time to predict\n", "the the pace of the runners. In practice we might also want to use\n", "characteristics of the course: how hilly it is, what the temperature was\n", "when the race was run. In 1904 the runners actually got lost during the\n", "race. Should we include 'lost' as a feature? It would certainly help\n", "explain the particularly slow time in 1904. The features we select\n", "should be ones we expect to correlate with the prediction. In\n", "statistics, these features are even called *predictors* which highlights\n", "their role in developing the prediction function. For Facebook newsfeed,\n", "we might use features that include how close your friendship is with the\n", "poster, or how often you react to that poster, or whether a photo is\n", "included in the post.\n", "\n", "Sometimes we use feature selection algorithms, algorithms that automate\n", "the process of finding the features that we need. Classification is\n", "often used to rank search results, to decide which adverts to serve or,\n", "at Facebook, to determine what appears at the top of your newsfeed. In\n", "the Facebook example features might include how many likes a post has\n", "had, whether it has an image in it, whether you regularly interact with\n", "the friend who has posted. A good newsfeed ranking algorithm is critical\n", "to Facebook's success, just as good ad serving choice is critical to\n", "Google's success. These algorithms are in turn highly dependent on the\n", "feature sets used. Facebook in particular has made heavy investments in\n", "machine learning pipelines for evaluation of the feature utility.\n", "\n", "## Class of Function, $\\mappingFunction(\\cdot)$\n", "\n", "By class of function we mean, what are the characteristics of the\n", "mapping between $\\mathbf{x}$ and $y$. Often, we might choose it to be a\n", "smooth function. Sometimes we will choose it to be a linear function. If\n", "the prediction is a forecast, for example the demand of a particular\n", "product, then the function would need some periodic components to\n", "reflect seasonal or weekly effects.\n", "\n", "## Analysis of US Birth Rates \\[edit\\]\n", "\n", "\n", "\n", "Figure: This is a retrospective analysis of US births by Aki Vehtari.\n", "The challenges of forecasting. Even with seasonal and weekly effects\n", "removed there are significant effects on holidays, weekends, etc.\n", "\n", "There's a nice analysis of US birth rates by Gaussian processes with\n", "additive covariances in @Gelman:bayesian13. A combination of covariance\n", "functions are used to take account of weekly and yearly trends. The\n", "analysis is summarized on the cover of the book.\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "Figure: Two different editions of Bayesian Data Analysis\n", "[@Gelman:bayesian13].\n", "\n", "In the ImageNet challenge the input, $\\inputVector$, was in the form of\n", "an image. And the form of the prediction function was a *convolutional\n", "neural network* (more on this later). A convolutional neural network\n", "introduces *invariances* into the function that are particular to image\n", "classification. An invariance is a transformation of the input that we\n", "don't want to affect the output. For example, a cat in an image is still\n", "a cat no matter where it's located in the image (translation). The cat\n", "is also a cat regardless of how large it is (scale), or whether it's\n", "upside-down (rotation). Convolutional neural networks encode these\n", "invariances: scale invariance, rotation invariance and translation\n", "invariance; in the mathematical function.\n", "\n", "Encoding invariance in the prediction function is like encoding\n", "knowledge in the model. If we don't specify these invariances, then the\n", "model must learn them. This will require a lot more data to achieve the\n", "same performance, making the model less data efficient. Note that one\n", "invariance that is *not* encoded in a convolutional network is\n", "invariance to camera type. As a result, practitioners need to be careful\n", "to ensure that their training data is representative of the type of\n", "cameras that will be used when the model is deployed.\n", "\n", "In general the prediction function could be any set of parameterized\n", "functions. In the Olympic marathon data example above we used a\n", "polynomial fit,\n", "\n", "$$\n", "\\mappingFunction(\\inputScalar) = \\weightScalar_0 + \\weightScalar_1 \\inputScalar+ \\weightScalar_2 \\inputScalar^2 + \\weightScalar_3 \\inputScalar^3 + \\weightScalar_4 \\inputScalar^4.\n", "$$\n", "\n", "The Olympic example is also a supervised learning challenge. But it is a\n", "*regression* problem. A regression problem is one where the output is a\n", "continuous value (such as the pace in the marathon). In classification\n", "the output is constrained to be discrete. For example, classifying\n", "whether or not an image contains a dog implies the output is binary. An\n", "early example of a regression problem used in machine learning was [the\n", "Tecator data](http://lib.stat.cmu.edu/datasets/tecator), where the fat,\n", "water and protein content of meat samples was predicted as a function of\n", "the absorption of infrared light.\n", "\n", "# Deep Learning \\[edit\\]\n", "\n", "Classical statistical models and simple machine learning models have a\n", "great deal in common. The main difference between the fields is\n", "philosophical. Machine learning practitioners are typically more\n", "concerned with the quality of prediciton (e.g. measured by ROC curve)\n", "while statisticians tend to focus more on the interpretability of the\n", "model and the validity of any decisions drawn from that interpretation.\n", "For example, a statistical model may be used to validate whether a large\n", "scale intervention (such as the mass provision of mosquito nets) has had\n", "a long term effect on disease (such as malaria). In this case one of the\n", "covariates is likely to be the provision level of nets in a particular\n", "region. The response variable would be the rate of malaria disease in\n", "the region. The parmaeter, $\\beta_1$ associated with that covariate will\n", "demonstrate a positive or negative effect which would be validated in\n", "answering the question. The focus in statistics would be less on the\n", "accuracy of the response variable and more on the validity of the\n", "interpretation of the effect variable, $\\beta_1$.\n", "\n", "A machine learning practitioner on the other hand would typically denote\n", "the parameter $w_1$, instead of $\\beta_1$ and would only be interested\n", "in the output of the prediction function, $\\mappingFunction(\\cdot)$\n", "rather than the parameter itself. The general formalism of the\n", "prediction function allows for *non-linear* models. In machine learning,\n", "the emphasis on prediction over interpretability means that non-linear\n", "models are often used. The parameters, $\\mathbf{w}$, are a means to an\n", "end (good prediction) rather than an end in themselves (interpretable).\n", "\n", "\n", "### DeepFace \\[edit\\]\n", "\n", "\n", "\n", "Figure: The DeepFace architecture [@Taigman:deepface14], visualized\n", "through colors to represent the functional mappings at each layer. There\n", "are 120 million parameters in the model.\n", "\n", "The DeepFace architecture [@Taigman:deepface14] consists of layers that\n", "deal with *translation* and *rotational* invariances. These layers are\n", "followed by three locally-connected layers and two fully-connected\n", "layers. Color illustrates feature maps produced at each layer. The\n", "neural network includes more than 120 million parameters, where more\n", "than 95% come from the local and fully connected layers.\n", "\n", "### Deep Learning as Pinball \\[edit\\]\n", "\n", "\n", "\n", "Figure: Deep learning models are composition of simple functions. We\n", "can think of a pinball machine as an analogy. Each layer of pins\n", "corresponds to one of the layers of functions in the model. Input data\n", "is represented by the location of the ball from left to right when it is\n", "dropped in from the top. Output class comes from the position of the\n", "ball as it leaves the pins at the bottom.\n", "\n", "Sometimes deep learning models are described as being like the brain, or\n", "too complex to understand, but one analogy I find useful to help the\n", "gist of these models is to think of them as being similar to early pin\n", "ball machines.\n", "\n", "In a deep neural network, we input a number (or numbers), whereas in\n", "pinball, we input a ball.\n", "\n", "Think of the location of the ball on the left-right axis as a single\n", "number. Our simple pinball machine can only take one number at a time.\n", "As the ball falls through the machine, each layer of pins can be thought\n", "of as a different layer of 'neurons'. Each layer acts to move the ball\n", "from left to right.\n", "\n", "In a pinball machine, when the ball gets to the bottom it might fall\n", "into a hole defining a score, in a neural network, that is equivalent to\n", "the decision: a classification of the input object.\n", "\n", "An image has more than one number associated with it, so it is like\n", "playing pinball in a *hyper-space*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('pinball{sample:0>3}.svg', \n", " '../slides/diagrams',\n", " sample=IntSlider(1, 1, 2, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: At initialization, the pins, which represent the parameters\n", "of the function, aren't in the right place to bring the balls to the\n", "correct decisions.\n", "\n", "\n", "\n", "Figure: After learning the pins are now in the right place to bring\n", "the balls to the correct decisions.\n", "\n", "Learning involves moving all the pins to be in the correct position, so\n", "that the ball ends up in the right place when it's fallen through the\n", "machine. But moving all these pins in hyperspace can be difficult.\n", "\n", "In a hyper-space you have to put a lot of data through the machine for\n", "to explore the positions of all the pins. Even when you feed many\n", "millions of data points through the machine, there are likely to be\n", "regions in the hyper-space where no ball has passed. When future test\n", "data passes through the machine in a new route unusual things can\n", "happen.\n", "\n", "*Adversarial examples* exploit this high dimensional space. If you have\n", "access to the pinball machine, you can use gradient methods to find a\n", "position for the ball in the hyper space where the image looks like one\n", "thing, but will be classified as another.\n", "\n", "Probabilistic methods explore more of the space by considering a range\n", "of possible paths for the ball through the machine. This helps to make\n", "them more data efficient and gives some robustness to adversarial\n", "examples.\n", "\n", "## Parameter Estimation: Objective Functions\n", "\n", "Once we have a set of features, and the class of functions we use is\n", "determined, we need to find the parameters of the model.\n", "\n", "The parameters of the model, $\\mathbf{w}$, are estimated by specifying\n", "an *objective function*. The objective function specifies the quality of\n", "the match between the prediction function and the *training data*. In\n", "supervised learning the objective function incorporates both the input\n", "data (in the ImageNet data the image, in the Olympic marathon data the\n", "year of the marathon) and a *label*.\n", "\n", "The label is where the term supervised learning comes from. The idea\n", "being that a supervisor, or annotator, has already looked at the data\n", "and given it labels. For regression problem, a typical objective\n", "function is the *squared error*, $$\n", "\\errorFunction(\\weightVector) = \\sum_{i=1}^\\numData (\\dataScalar_i - \\mappingFunction(\\inputVector_i))^2\n", "$$ where the data is provided to us as a set of $n$ inputs,\n", "$\\inputVector_1$, $\\inputVector_2$, $\\inputVector_3$, $\\dots$,\n", "$\\inputVector_n$ each one with an associated label, $\\dataScalar_1$,\n", "$\\dataScalar_2$, $\\dataScalar_3$, $\\dots$, $\\dataScalar_\\numData$.\n", "Sometimes the label is cheap to acquire. For example, in Newsfeed\n", "ranking Facebook are acquiring a label each time a user clicks on a post\n", "in their Newsfeed. Similarly, in ad-click prediction labels are obtained\n", "whenever an advert is clicked. More generally though, we have to employ\n", "human annotators to label the data. For example, ImageNet, the\n", "breakthrough deep learning result was annotated using Amazon's\n", "Mechanical Turk. Without such large scale human input, we would not have\n", "the breakthrough results on image categorization we have today.\n", "\n", "Some tasks are easier to annotate than others. For example, in the\n", "Tecator data, to acquire the actual values of water, protein and fat\n", "content in the meat samples further experiments may be required. It is\n", "not simply a matter of human labelling. Even if the task is easy for\n", "humans to solve there can be problems. For example, humans will\n", "extrapolate the context of an image. A colleague mentioned once to me a\n", "challenge where humans were labelling images as containing swimming\n", "pools, even though none was visible, because they could infer there must\n", "be a pool nearby, perhaps because there are kids wearing bathing suits.\n", "But there is no swimming pool in the image for the computer to find. The\n", "quality of any machine learning solution is very sensitive to the\n", "quality of annotated data we have. Investing in processes and tools to\n", "improve annotation of data is therefore priority for improving the\n", "quality of machine learning solutions.\n", "\n", "There can also be significant problems with misrepresentation in the\n", "data set. If data isn't collected carefully, then it can reflect biases\n", "about the population that we don't want our models to have. For example,\n", "if we design a face detector using Californians may not perform well\n", "when deployed in Kampala, Uganda.\n", "\n", "## Generalization and Overfitting\n", "\n", "Once a supervised learning system is trained it can be placed in a\n", "sequential pipeline to automate a process that used to be done manually.\n", "\n", "Supervised learning is one of the dominant approaches to learning. But\n", "the cost and time associated with labeling data is a major bottleneck\n", "for deploying machine learning systems. The process for creating\n", "training data requires significant human intervention. For example,\n", "internationalization of a speech recognition system would require large\n", "speech corpora in new languages.\n", "\n", "An important distinction in machine learning is the separation between\n", "training data and test data (or production data). Training data is the\n", "data that was used to find the model parameters. Test data (or\n", "production data) is the data that is used with the live system. The\n", "ability of a machine learning system to predict well on production\n", "systems given only its training data is known as its *generalization*\n", "ability. This is the system's ability to predict in areas where it\n", "hasn't previously seen data.\n", "\n", "## Hold Out Validation on Olympic Marathon Data \\[edit\\]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_val_extra_LM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " num_basis=IntSlider(1, 1, max_basis, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Olympic marathon data with validation error for\n", "extrapolation.\n", "\n", "## Extrapolation\n", "\n", "## Interpolation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_val_inter_LM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " num_basis=IntSlider(1, 1, max_basis, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Olympic marathon data with validation error for\n", "interpolation.\n", "\n", "## Choice of Validation Set\n", "\n", "## Hold Out Data\n", "\n", "You have a conclusion as to which model fits best under the training\n", "error, but how do the two models perform in terms of validation? In this\n", "section we consider *hold out* validation. In hold out validation we\n", "remove a portion of the training data for *validating* the model on. The\n", "remaining data is used for fitting the model (training). Because this is\n", "a time series prediction, it makes sense for us to hold out data at the\n", "end of the time series. This means that we are validating on future\n", "predictions. We will hold out data from after 1980 and fit the model to\n", "the data before 1980." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select indices of data to 'hold out'\n", "indices_hold_out = np.flatnonzero(x>1980)\n", "\n", "# Create a training set\n", "x_train = np.delete(x, indices_hold_out, axis=0)\n", "y_train = np.delete(y, indices_hold_out, axis=0)\n", "\n", "# Create a hold out set\n", "x_valid = np.take(x, indices_hold_out, axis=0)\n", "y_valid = np.take(y, indices_hold_out, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 3\n", "\n", "For both the linear and quadratic models, fit the model to the data up\n", "until 1980 and then compute the error on the held out data (from 1980\n", "onwards). Which model performs better on the validation data?\n", "\n", "*10 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 3 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Richer Basis Set\n", "\n", "Now we have an approach for deciding which model to retain, we can\n", "consider the entire family of polynomial bases, with arbitrary degrees.\n", "\n", "### Question 4\n", "\n", "Now we are going to build a more sophisticated form of basis function,\n", "one that can accept arguments to its inputs (similar to those we used in\n", "[this lab](./week4.ipynb)). Here we will start with a polynomial basis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def polynomial(x, degree, loc, scale):\n", " degrees =np.arange(degree+1)\n", " return ((x-loc)/scale)**degrees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basis as we've defined it has three arguments as well as the input.\n", "The degree of the polynomial, the scale of the polynomial and the\n", "offset. These arguments need to be passed to the basis functions\n", "whenever they are called. Modify your code to pass these additional\n", "arguments to the python function for creating the basis. Do this for\n", "each of your functions `predict`, `fit` and `objective`. You will find\n", "`*args` (or `**kwargs`) useful.\n", "\n", "Write code that tries to fit different models to the data with\n", "polynomial basis. Use a maximum degree for your basis from 0 to 17. For\n", "each polynomial store the *hold out validation error* and the *training\n", "error*. When you have finished the computation plot the hold out error\n", "for your models and the training error for your p. When computing your\n", "polynomial basis use `offset=1956.` and `scale=120.` to ensure that the\n", "data is mapped (roughly) to the -1, 1 range.\n", "\n", "Which polynomial has the minimum training error? Which polynomial has\n", "the minimum validation error?\n", "\n", "*25 marks*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write code for your answer to Question 4 in this box\n", "# provide the answers so that the code runs correctly otherwise you will loose marks!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bias Variance Decomposition \\[edit\\]\n", "\n", "Expected test error for different variations of the *training data*\n", "sampled from, $\\Pr(\\dataVector, \\dataScalar)$\n", "$$\\mathbb{E}\\left[ \\left(\\dataScalar - \\mappingFunction^*(\\dataVector)\\right)^2 \\right]$$\n", "Decompose as\n", "$$\\mathbb{E}\\left[ \\left(\\dataScalar - \\mappingFunction(\\dataVector)\\right)^2 \\right] = \\text{bias}\\left[\\mappingFunction^*(\\dataVector)\\right]^2 + \\text{variance}\\left[\\mappingFunction^*(\\dataVector)\\right] +\\sigma^2$$\n", "\n", "- Given by $$\\text{bias}\\left[\\mappingFunction^*(\\dataVector)\\right] =\n", " \\mathbb{E}\\left[\\mappingFunction^*(\\dataVector)\\right] * \\mappingFunction(\\dataVector)$$\n", "- Error due to bias comes from a model that's too simple.\n", "\n", "- Given by\n", " $$\\text{variance}\\left[\\mappingFunction^*(\\dataVector)\\right] = \\mathbb{E}\\left[\\left(\\mappingFunction^*(\\dataVector) - \\mathbb{E}\\left[\\mappingFunction^*(\\dataVector)\\right]\\right)^2\\right]$$\n", "- Slight variations in the training set cause changes in the\n", " prediction. Error due to variance is error in the model due to an\n", " overly complex model.\n", "\n", "## Bias vs Variance Error Plots \\[edit\\]\n", "\n", "Helper function for sampling data from two different classes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_data(per_cluster=30):\n", " \"\"\"Create a randomly sampled data set\n", " \n", " :param per_cluster: number of points in each cluster\n", " \"\"\"\n", " X = []\n", " y = []\n", " scale = 3\n", " prec = 1/(scale*scale)\n", " pos_mean = [[-1, 0],[0,0.5],[1,0]]\n", " pos_cov = [[prec, 0.], [0., prec]]\n", " neg_mean = [[0, -0.5],[0,-0.5],[0,-0.5]]\n", " neg_cov = [[prec, 0.], [0., prec]]\n", " for mean in pos_mean:\n", " X.append(np.random.multivariate_normal(mean=mean, cov=pos_cov, size=per_class))\n", " y.append(np.ones((per_class, 1)))\n", " for mean in neg_mean:\n", " X.append(np.random.multivariate_normal(mean=mean, cov=neg_cov, size=per_class))\n", " y.append(np.zeros((per_class, 1)))\n", " return np.vstack(X), np.vstack(y).flatten()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Helper function for plotting the decision boundary of the SVM." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_contours(ax, cl, xx, yy, **params):\n", " \"\"\"Plot the decision boundaries for a classifier.\n", "\n", " :param ax: matplotlib axes object\n", " :param cl: a classifier\n", " :param xx: meshgrid ndarray\n", " :param yy: meshgrid ndarray\n", " :param params: dictionary of params to pass to contourf, optional\n", " \"\"\"\n", " Z = cl.decision_function(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " # Plot decision boundary and regions\n", " out = ax.contour(xx, yy, Z, \n", " levels=[-1., 0., 1], \n", " colors='black', \n", " linestyles=['dashed', 'solid', 'dashed'])\n", " out = ax.contourf(xx, yy, Z, \n", " levels=[Z.min(), 0, Z.max()], \n", " colors=[[0.5, 1.0, 0.5], [1.0, 0.5, 0.5]])\n", " return out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def decision_boundary_plot(models, X, y, axs, filename, titles, xlim, ylim):\n", " \"\"\"Plot a decision boundary on the given axes\n", " \n", " :param axs: the axes to plot on.\n", " :param models: the SVM models to plot\n", " :param titles: the titles for each axis\n", " :param X: input training data\n", " :param y: target training data\"\"\"\n", " for ax in axs.flatten():\n", " ax.clear()\n", " X0, X1 = X[:, 0], X[:, 1]\n", " if xlim is None:\n", " xlim = [X0.min()-1, X0.max()+1]\n", " if ylim is None:\n", " ylim = [X1.min()-1, X1.max()+1]\n", " xx, yy = np.meshgrid(np.arange(xlim[0], xlim[1], 0.02),\n", " np.arange(ylim[0], ylim[1], 0.02))\n", " for cl, title, ax in zip(models, titles, axs.flatten()):\n", " plot_contours(ax, cl, xx, yy,\n", " cmap=plt.cm.coolwarm, alpha=0.8)\n", " ax.plot(X0[y==1], X1[y==1], 'r.', markersize=10)\n", " ax.plot(X0[y==0], X1[y==0], 'g.', markersize=10)\n", " ax.set_xlim(xlim)\n", " ax.set_ylim(ylim)\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " ax.set_title(title)\n", " mlai.write_figure(os.path.join(filename),\n", " figure=fig,\n", " transparent=True)\n", " return xlim, ylim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib\n", "font = {'family' : 'sans',\n", " 'weight' : 'bold',\n", " 'size' : 22}\n", "\n", "matplotlib.rc('font', **font)\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create an instance of SVM and fit the data. \n", "C = 100.0 # SVM regularization parameter\n", "gammas = [0.001, 0.01, 0.1, 1]\n", "\n", "\n", "per_class=30\n", "num_samps = 20\n", "# Set-up 2x2 grid for plotting.\n", "fig, ax = plt.subplots(1, 4, figsize=(10,3))\n", "xlim=None\n", "ylim=None\n", "for samp in range(num_samps):\n", " X, y=create_data(per_class)\n", " models = []\n", " titles = []\n", " for gamma in gammas:\n", " models.append(svm.SVC(kernel='rbf', gamma=gamma, C=C))\n", " titles.append('$\\gamma={}$'.format(gamma))\n", " models = (cl.fit(X, y) for cl in models)\n", " xlim, ylim = decision_boundary_plot(models, X, y, \n", " axs=ax, \n", " filename='../slides/diagrams/ml/bias-variance{samp:0>3}.svg'.format(samp=samp), \n", " titles=titles,\n", " xlim=xlim,\n", " ylim=ylim)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('bias-variance{samp:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " samp=IntSlider(0,0,10,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: In each figure the more simple model is on the left, and the\n", "more complex model is on the right. Each fit is done to a different\n", "version of the data set. The simpler model is more consistent in its\n", "errors (bias error), whereas the more complex model is varying in its\n", "errors (variance error).\n", "\n", "## Overfitting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('py8QrZPT48s')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: Alex Ihler discusses polynomials and overfitting.\n", "\n", "We can easily develop a simple prediction function that reconstructs the\n", "training data exactly, you can just use a look up table. But how would\n", "the lookup table predict between the training data, where examples\n", "haven't been seen before? The choice of the class of prediction\n", "functions is critical in ensuring that the model generalizes well.\n", "\n", "The generalization error is normally estimated by applying the objective\n", "function to a set of data that the model *wasn't* trained on, the test\n", "data. To ensure good performance we normally want a model that gives us\n", "a low generalization error. If we weren't sure of the right prediction\n", "function to use, then we could try 1,000 different prediction functions.\n", "Then we could use the one that gives us the lowest error on the test\n", "data. But you have to be careful. Selecting a model in this way is like\n", "a further stage of training where you are using the test data in the\n", "training.[^3] So when this is done, the data used for this is not known\n", "as test data, it is known as *validation data*. And the associated error\n", "is the *validation error*. Using the validation error for model\n", "selection is a standard machine learning technique, but it can be\n", "misleading about the final generalization error. Almost all machine\n", "learning practitioners know not to use the test data in your training\n", "procedure, but sometimes people forget that when validation data is used\n", "for model selection that validation error cannot be used as an unbiased\n", "estimate of the generalization performance.\n", "\n", "## Olympic Data with Bayesian Polynomials \\[edit\\]\n", "\n", "Five fold cross validation tests the ability of the model to\n", "*interpolate*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mlai\n", "import pods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_BLM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml/', \n", " num_basis=IntSlider(1, 1, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bayesian fit with 26th degree polynomial and negative\n", "marginal log likelihood.\n", "\n", "## Hold Out Validation\n", "\n", "For the polynomial fit, we will now look at *hold out* validation, where\n", "we are holding out some of the most recent points. This tests the abilit\n", "of our model to *extrapolate*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_val_BLM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " num_basis=IntSlider(1, 1, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bayesian fit with 26th degree polynomial and hold out\n", "validation scores.\n", "\n", "## 5-fold Cross Validation\n", "\n", "Five fold cross validation tests the ability of the model to\n", "*interpolate*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('olympic_5cv{part:0>2}_BLM_polynomial_number{num_basis:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " part=(0, 5), \n", " num_basis=IntSlider(1, 1, 27, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bayesian fit with 26th degree polynomial and five fold cross\n", "validation scores.\n", "\n", "\n", "# Unsupervised Learning \\[edit\\]\n", "\n", "In unsupervised learning you have data, $\\inputVector$, but no labels\n", "$\\dataScalar$. The aim in unsupervised learning is to extract structure\n", "from data. The type of structure you are interested in is dependent on\n", "the broader context of the task. In supervised learning that context is\n", "very much driven by the labels. Supervised learning algorithms try and\n", "focus on the aspects of the data which are relevant to predicting the\n", "labels. But in unsupervised learning there are no labels.\n", "\n", "## Context\n", "\n", "Humans can easily sort a number of objects into objects that share\n", "similar characteristics. We easily categorize animals or vehicles. But\n", "if the data is very large this is too slow. Even for smaller data, it\n", "may be that it is presented in a form that is unintelligible for humans.\n", "We are good at dealing with high dimensional data when it's presented in\n", "images, but if it's presented as a series of numbers, we find it hard to\n", "interpret. In unsupervised learning we want the computer to do the\n", "sorting for us. For example, an e-commerce company might need an\n", "algorithm that can go through its entire list of products and\n", "automatically sort them into groups such that similar products are\n", "located together.\n", "\n", "## Discrete vs Continuous\n", "\n", "Supervised learning is broadly divided into classification: i.e. wake\n", "word classification in the Amazon Echo, and regression, e.g. shelf life\n", "prediction for perishable goods. Similarly, unsupervised learning can be\n", "broadly split into methods that cluster the data (i.e. provide a\n", "discrete label) and methods that represent the data as a continuous\n", "value.\n", "\n", "## Clustering \\[edit\\]\n", "\n", "Clustering methods associate each data point with a different label.\n", "Unlike in classification the label is not provided by a human annotator.\n", "It is allocated by the computer. Clustering is quite intuitive for\n", "humans, we do it naturally with our observations of the real world. For\n", "example, we cluster animals into different groups. If we encounter a new\n", "animal, we can immediately assign it to a group: bird, mammal, insect.\n", "These are certainly labels that can be provided by humans, but they were\n", "also originally invented by humans. With clustering we want the computer\n", "to recreate that process of inventing the label.\n", "\n", "Unsupervised learning enables computers to form similar categorizations\n", "on data that is too large scale for us to process. When the Greek\n", "philosopher, Plato, was thinking about ideas, he considered the concept\n", "of the Platonic ideal. The Platonic ideal bird is the bird that is most\n", "bird-like or the chair that is most chair-like. In some sense, the task\n", "in clustering is to define different clusters, by finding their Platonic\n", "ideal (known as the cluster center) and allocate each data point to the\n", "relevant cluster center. So, allocate each animal to the class defined\n", "by its nearest cluster center.\n", "\n", "To perform clustering on a computer we need to define a notion of either\n", "similarity or distance between the objects and their Platonic ideal, the\n", "cluster center. We normally assume that our objects are represented by\n", "vectors of data, $\\inputVector_i$. Similarly, we represent our cluster\n", "center for category $j$ by a vector $\\meanVector_j$. This vector\n", "contains the ideal features of a bird, a chair, or whatever category $j$\n", "is. In clustering we can either think in terms of similarity of the\n", "objects, or distances. We want objects that are similar to each other to\n", "cluster together. We want objects that are distant from each other to\n", "cluster apart.\n", "\n", "This requires us to formalize our notion of similarity or distance.\n", "Let's focus on distances. A definition of distance between an object,\n", "$i$, and the cluster center of class $j$ is a function of two vectors,\n", "the data point, $\\inputVector_i$ and the cluster center,\n", "$\\meanVector_j$, $$\n", "d_{ij} = f(\\inputVector_i, \\meanVector_j).\n", "$$ Our objective is then to find cluster centers that are close to as\n", "many data points as possible. For example, we might want to cluster\n", "customers into their different tastes. We could represent each customer\n", "by the products they've purchased in the past. This could be a binary\n", "vector $\\inputVector_i$. We can then define a distance between the\n", "cluster center and the customer.\n", "\n", "### Squared Distance\n", "\n", "A commonly used distance is the squared distance, $$\n", "\\distanceScalar_{ij} = (\\inputVector_i - \\meanVector_j)^2.\n", "$$ The squared distance comes up a lot in machine learning. In\n", "unsupervised learning it was used to measure dissimilarity between\n", "predictions and observed data. Here its being used to measure the\n", "dissimilarity between a cluster center and the data.\n", "\n", "Once we have decided on the distance or similarity function, we can\n", "decide a number of cluster centers, $K$. We find their location by\n", "allocating each center to a sub-set of the points and minimizing the sum\n", "of the squared errors, $$\n", "\\errorFunction(\\meanMatrix) = \\sum_{i \\in \\mathbf{i}_j} (\\inputVector_i - \\meanVector_j)^2\n", "$$ where the notation $\\mathbf{i}_j$ represents all the indices of each\n", "data point which has been allocated to the $j$th cluster represented by\n", "the center $\\meanVector_j$.\n", "\n", "### $k$-Means Clustering\n", "\n", "One approach to minimizing this objective function is known as\n", "*$k$-means clustering*. It is simple and relatively quick to implement,\n", "but it is an initialization sensitive algorithm. Initialization is the\n", "process of choosing an initial set of parameters before optimization.\n", "For $k$-means clustering you need to choose an initial set of centers.\n", "In $k$-means clustering your final set of clusters is very sensitive to\n", "the initial choice of centers. For more technical details on $k$-means\n", "clustering you can watch a video of Alex Ihler introducing the algorithm\n", "here.\n", "\n", "### $k$-Means Clustering\n", "\n", "\n", "\n", "Figure: Clustering with the $k$-means clustering algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('mfqmoUN-Cuw')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: $k$-means clustering by Alex Ihler.\n", "\n", "### Hierarchical Clustering\n", "\n", "Other approaches to clustering involve forming taxonomies of the cluster\n", "centers, like humans apply to animals, to form trees. You can learn more\n", "about agglomerative clustering in this video from Alex Ihler." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.lib.display import YouTubeVideo\n", "YouTubeVideo('OcoE7JlbXvY')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure: Hierarchical Clustering by Alex Ihler.\n", "\n", "### Phylogenetic Trees\n", "\n", "Indeed, one application of machine learning techniques is performing a\n", "hierarchical clustering based on genetic data, i.e. the actual contents\n", "of the genome. If we do this across a number of species then we can\n", "produce a *phylogeny*. The phylogeny aims to represent the actual\n", "evolution of the species and some phylogenies even estimate the timing\n", "of the common ancestor between two species[^4]. Similar methods are used\n", "to estimate the origin of viruses like AIDS or Bird flu which mutate\n", "very quickly. Determining the origin of viruses can be important in\n", "containing or treating outbreaks.\n", "\n", "### Product Clustering\n", "\n", "An e-commerce company could apply hierarchical clustering to all its\n", "products. That would give a phylogeny of products. Each cluster of\n", "products would be split into sub-clusters of products until we got down\n", "to individual products. For example, we might expect a high level split\n", "to be Electronics/Clothing. Of course, a challenge with these tree-like\n", "structures is that many products belong in more than one parent cluster:\n", "for example running shoes should be in more than one group, they are\n", "'sporting goods' and they are 'apparel'. A tree structure doesn't allow\n", "this allocation.\n", "\n", "### Hierarchical Clustering Challenge\n", "\n", "Our own psychological grouping capabilities are studied as a domain of\n", "cognitive science. Researchers like Josh Tenenbaum have developed\n", "algorithms that decompose data in more complex ways, but they can\n", "normally only be applied to smaller data sets.\n", "\n", "## Dimensionality Reduction \\[edit\\]\n", "\n", "Dimensionality reduction methods compress the data by replacing the\n", "original data with a reduced number of continuous variables. One way of\n", "thinking of these methods is to imagine a marionette.\n", "\n", "\n", "\n", "Figure: Thinking of dimensionality reduction as a marionette. We\n", "observe the high dimensional pose of the puppet, $\\inputVector$, but the\n", "movement of the puppeteer's hand, $\\latentVector$ remains hidden to us.\n", "Dimensionality reduction aims to recover those hidden movements which\n", "generated the observations.\n", "\n", "The position of each body part of a marionette could be thought of as\n", "our data, $\\inputVector_i$. So, each data point consists of the 3-D\n", "co-ordinates of all the different body parts of the marionette. Let's\n", "say there are 13 different body parts (2 each of feet, knees, hips,\n", "hands, elbows, shoulders, one head). Each body part has an x, y, z\n", "position in Cartesian coordinates. So that's 39 numbers associated with\n", "each observation.\n", "\n", "The movement of these 39 parts is determined by the puppeteer via\n", "strings. Let's assume it's a very simple puppet, with just one stick to\n", "control it. The puppeteer can move the stick up and down, left and\n", "right. And they can twist it. This gives three parameters in the\n", "puppeteers control. This implies that the 39 variables we see moving are\n", "controlled by only 3 variables. These 3 variables are often called the\n", "hidden or *latent variables*.\n", "\n", "Dimensionality reduction assumes something similar for real world data.\n", "It assumes that the data we observe is generated from some lower\n", "dimensional underlying process. It then seeks to recover the values\n", "associated with this low dimensional process.\n", "\n", "### Examples in Social Sciences\n", "\n", "Dimensionality reduction techniques underpin a lot of psychological\n", "scoring tests such as IQ tests or personality tests. An IQ test can\n", "involve several hundred questions, potentially giving a rich, high\n", "dimensional, characterization of some aspects of your intelligence. It\n", "is then summarized by a single number. Similarly, the Myers-Briggs\n", "personality test involves answering questions about preferences which\n", "are reduced to a set of numbers reflecting personality.\n", "\n", "These tests are assuming that our intelligence is implicitly\n", "one-dimensional and that our personality is implicitly four dimensional.\n", "Other examples include political belief which is typically represented\n", "on a left to right scale. A one-dimensional distillation of an entire\n", "philosophy about how a country should be run. Our own leadership\n", "principles imply that our decisions have a fourteen-dimensional space\n", "underlying them. Each decision could be characterized by judging to what\n", "extent it embodies each of the principles.\n", "\n", "Political belief, personality, intelligence, leadership. None of these\n", "exist as a directly measurable quantity in the real world, rather they\n", "are inferred based on measurables. Dimensionality reduction is the\n", "process of allowing the computer to automatically find such underlying\n", "dimensions. This automatically allowing us to characterize each data\n", "point according to those explanatory variables. Each of these\n", "characteristics can be scored, and individuals can then be turned into\n", "vectors.\n", "\n", "This doesn't only apply to individuals, in recent years work on language\n", "modeling has taken a similar approach to words. The\n", "[word2vec](https://arxiv.org/abs/1301.3781) algorithm performed a\n", "dimensionality reduction on words, now you can take any word and map it\n", "to a latent space where similar words exhibit similar characteristics. A\n", "personality space for words.\n", "\n", "### Principal Component Analysis\n", "\n", "Principal component analysis (PCA) is arguably the queen of\n", "dimensionality reduction techniques. PCA was developed as an approach to\n", "dimensionality reduction in 1930s by Hotelling as a method for the\n", "social sciences. In Hotelling's formulation of PCA it was assumed that\n", "any data point, $\\mathbf{x}$ could be represented as a weighted sum of\n", "the latent factors of interest, so that Hotelling described prediction\n", "functions (like in regression and classification above), only the\n", "regression is now *multiple output*. And instead of predicting a label,\n", "$y_i$, we now try and force the regression to predict the observed\n", "feature vector, $\\dataVector_i$. So, for example, on an IQ test we would\n", "try and predict subject $i$'s answer to the $j$th question with the\n", "following function $$\n", "\\dataScalar_{ij} = \\mappingFunction_j(\\latentScalar_i; \\weightVector).\n", "$$ Here $z_i$ would be the IQ of subject $i$ and\n", "$\\mappingFunction_j(\\cdot)$ would be a function representing the\n", "relationship between the subject's IQ and their score on the answer to\n", "question $j$. This function is the same for all subjects, but the\n", "subject's IQ is assumed to differ leading to different scores for each\n", "subject.\n", "\n", "\n", "\n", "Figure: Visualization of the first two principal components of an\n", "artificial data set. The data was generated by taking an image of a\n", "handwritten digit, 6, and rotating it 360 times, one degree each time.\n", "The first two principal components have been extracted in the diagram.\n", "The underlying circular shape is derived from the rotation of the data.\n", "Each image in the data set is projected on to the location its projected\n", "to in the latent space.\n", "\n", "### Hotelling's PCA\n", "\n", "In Hotelling's formulation he assumed that the function was a linear\n", "function. This idea is taken from a wider field known as *factor\n", "analysis*, so Hotelling described the challenge as $$\n", "\\mappingFunction_j(\\latentScalar_i; \\weightVector) = \\weightScalar_j \\latentScalar_i\n", "$$ so the answer to the $j$th question is predicted to be a scaling of\n", "the subject's IQ. The scale factor is given by $\\weightScalar_j$. If\n", "there are more latent dimensions then a matrix of parameters,\n", "$\\weightMatrix$ is used, for example if there were two latent\n", "dimensions, we'd have $$\n", "\\mappingFunction_j(\\mathbf{\\latentScalar}_i; \\weightMatrix) = \\weightScalar_{1j} \\latentScalar_{1i} + \\weightScalar_{2j} \\latentScalar_{2i}\n", "$$ where, if this were a personality test, then $\\latentScalar_{1i}$\n", "might represent the spectrum over a subject's extrovert/introvert and\n", "$\\latentScalar_{2i}$ might represent where the subject was on the\n", "rational/perceptual scale. The function would make a prediction about\n", "the subjects answer to a particular question on the test\n", "(e.g. preference for office job vs preference for outdoor job). In\n", "factor analysis the parameters $\\weightMatrix$ are known as the factor\n", "*loadings* and in PCA they are known as the principal components.\n", "\n", "### Parameters\n", "\n", "Fitting the model involves finding estimates for the loadings,\n", "$\\weightMatrix$, and latent variables, $\\latentMatrix$. There are\n", "different approaches including least squares. The least squares approach\n", "is used, for example, in recommender systems. In recommender systems\n", "this method is called *matrix factorization*. The customer\n", "characteristics, $\\dataVector_i$ is the customer rating for each\n", "different product (or item) and the latent variables can be seen as a\n", "space of customer preferences. In the recommender system case, the\n", "loadings matrix also has an interpretation as product similarities.[^5]\n", "Recommender systems have a particular characteristic in that most of the\n", "entries of the vector $\\dataVector_i$ are missing most of the time.\n", "\n", "In PCA and factor analysis the unknown latent factors are dealt with\n", "through a probability distribution. They are each assumed to be drawn\n", "from a zero mean, unit variance normal distribution. This leaves the\n", "factor loadings to be estimated. For PCA the maximum likelihood solution\n", "for the factor loadings can be shown to be given by the *eigenvalue\n", "decomposition* of the data covariance matrix. This is algorithmically\n", "simple and convenient, although slow to compute for very large data sets\n", "with many features and many subjects. The eigenvalue problem can also be\n", "derived from many other starting points: e.g. the directions of maximum\n", "variance in the data or finding a latent space that best preserves\n", "inter-point distances between the data, or the optimal linear\n", "compression of the data given a linear reconstruction. These many and\n", "varied justifications for the eigenvalue decomposition may account for\n", "the popularity of PCA. Indeed, there is even an interpretation for\n", "Google's original PageRank algorithm (which computed the *smallest*\n", "eigenvector of the internet's linkage matrix) as seeking the dominant\n", "principal component of the web.[^6]\n", "\n", "Characterizing users according to past buying behavior and combining\n", "this with characteristics about products, is key to making good\n", "recommendations and returning useful search results. Further advances\n", "can be made if we understand the context of a particular session. For\n", "example, if a user is buying Christmas presents and searches for a\n", "dress, then it could be the case that the user is willing to spend a\n", "little more on the dress than in normal circumstances. Characterizing\n", "these effects requires more data and more complex algorithms. However,\n", "in domains such a search we are normally constrained by the speed with\n", "which we need to return results. Accounting for each of these factors\n", "while returning results with acceptable latency is a particular\n", "challenge.\n", "\n", "# Reinforcement Learning \\[edit\\]\n", "\n", "The final domain of learning we will review is known as reinforcement\n", "learning. The domain of reinforcement learning is one that many\n", "researchers seem to believe is offering a route to *general\n", "intelligence*. The idea of general intelligence is to develop algorithms\n", "that are adaptable to many different circumstances. Supervised learning\n", "algorithms are designed to resolve particular challenges. Data is\n", "annotated with those challenges in mind. Unsupervised attempts to build\n", "representations without any context. But normally the algorithm designer\n", "has an understanding of what the broader objective is and designs the\n", "algorithms accordingly (for example, characterizing users). In\n", "reinforcement learning some context is given, in the form of a reward,\n", "but the reward is normally delayed. There may have been many actions\n", "that affected the outcome, but which actions had a positive effect and\n", "which a negative effect?\n", "\n", "## \"Reward\"\n", "\n", "- In reinforcement learning some context is given, in the form of a\n", " reward. But it is often *delayed*\n", "\n", "- Credit allocation problem: many actions that affected the outcome,\n", " but which actions had a positive effect and which a negative effect?\n", "\n", "One issue for many companies is that the best way of testing the\n", "customer experience, A/B testing, prioritizes short term reward. The\n", "internet is currently being driven by short term rewards which make it\n", "distracting in the short term, but perhaps less useful in the long term.\n", "Click-bait is an example, but there are more subtle effects. The success\n", "of Facebook is driven by its ability to draw us in when likely we should\n", "be doing something else. This is driven by large scale A/B testing.\n", "\n", "One open question is how to drive non-visual interfaces through\n", "equivalents to A/B testing. Speech interfaces, such as those used in\n", "intelligent agents, are less amenable to A/B testing when determining\n", "the quality of the interface. Improving interaction with them is\n", "therefore less exact science than the visual interface. Data efficient\n", "reinforcement learning methods are likely to be key to improving these\n", "agent's ability to interact with the user and understand intent.\n", "However, they are not yet mature enough to be deployed in this\n", "application.\n", "\n", "## Game Play\n", "\n", "An area where reinforcement learning methods have been deployed with\n", "high profile success is game play. In game play the reward is delayed to\n", "the end of the game, and it comes in the form of victory or defeat. A\n", "significant advantage of game play as an application area is that,\n", "through simulation of the game, it is possible to generate as much data\n", "as is required to solve the problem. For this reason, many of the recent\n", "advances in reinforcement learning have occurred with methods that are\n", "not data efficient.\n", "\n", "The company DeepMind is set up around reinforcement learning as an\n", "approach to general intelligence. All their most well-known achievements\n", "are centered around artificial intelligence in game play. In\n", "reinforcement learning a decision made at any given time have a\n", "downstream effect on the result. Whether the effect if beneficial or not\n", "is unknown until a future moment.\n", "\n", "We can think of reinforcement learning as providing a label, but the\n", "label is associated with a series of data involving a number of\n", "decisions taken. Each decision was taken given the understanding of game\n", "play at any given moment. Understanding which of these decisions was\n", "important in victory or defeat is a hard problem.\n", "\n", "In machine learning the process of understanding which decisions were\n", "beneficial and which were detrimental is known as the credit allocation\n", "problem. You wish to reward decisions that led to success to encourage\n", "them, but punish decisions that lead to failure.\n", "\n", "Broadly speaking, DeepMind uses an approach to Machine Learning where\n", "there are two mathematical functions at work. One determines the action\n", "to be taken at any given moment, the other estimates the quality of the\n", "board position at any given time. These are respectively known as the\n", "*policy network* and the *value network*.[^7] DeepMind made use of\n", "convolutional neural networks for both these models.\n", "\n", "## AlphaGo\n", "\n", "The ancient Chinese game of Go was considered a challenge for artificial\n", "intelligence for two reasons. Firstly, the game tree has a very high\n", "branching factor. The game tree is a discrete representation of the\n", "game. Every node in the game tree is associated with a board position.\n", "You can move through the game tree by making legal a move on the board\n", "to change the position. In Go, there are so many legal moves that the\n", "game tree increases exponentially. This challenge in Go was addressed by\n", "using stochastic game tree search. Rather than exploring the game tree\n", "exhaustively they explored it randomly.\n", "\n", "Secondly, evaluating the quality of any given board position was deemed\n", "to be very hard.[^8] The value function determines for each player\n", "whether they are winning or losing. Skilled Go players can assess a\n", "board position, but they do it by instinct, by intuition. Just as early\n", "AI researchers struggled to give rules for detecting cancer, it is\n", "challenging to give rules to assess a Go board. The machine learning\n", "approach that AlphaGo took is to train a value function network to make\n", "this assessment.\n", "\n", "The approach that DeepMind took to conquering Go is a *model-free*\n", "approach known as *Q-learning*.[^9] The model-free approach refers to\n", "the fact that they don't directly include a model of how the world\n", "evolves in the reinforcement learning algorithm. They make extensive use\n", "of the game tree, but they don't model how it evolves. They do model the\n", "expected reward of each position in the game tree (the value function)\n", "but that is not the same as modeling how the game will proceed.\n", "\n", "## Reinforcement Learning and Classical Control\n", "\n", "An alternative approach to reinforcement learning is to use a prediction\n", "function to suggest how the world will evolve in response to your\n", "actions. To predict how the game tree will evolve. You can then use this\n", "prediction to indirectly infer the expected reward associated with any\n", "action. This is known as *model-based* reinforcement learning.\n", "\n", "This model-based approach is also closer to a control system. A\n", "classical control system is one where you give the system a set point.\n", "For example, a thermostat in the house. You set the temperature and the\n", "boiler switches off when it reaches it. Optimal control is about getting\n", "the house to the right temperature as quickly as possible. Classical\n", "control is widely used in robotic control and flight control.\n", "\n", "One interesting crossover between classical control and machine learning\n", "arises because classical optimal control can be seen as a form of\n", "model-based reinforcement learning. One where the reward is recovered\n", "when the set point is reached. In control engineering the prediction\n", "function is known as the *transfer function*. The process of fitting the\n", "transfer function in control is known as *system identification*.\n", "\n", "There is some exciting work emerging at the interface between the areas\n", "of control and reinforcement learning. Results at this interface could\n", "be very important for improving the quality of robotic and drone\n", "control.\n", "\n", "## Optimization Methods\n", "\n", "As we implied above, reinforcement learning can also used to improve\n", "user experience. In that case the reward is gained when the user buys a\n", "product from us. This makes it closely allied to the area of\n", "optimization. Optimization of our user interfaces can be seen as a\n", "reinforcement learning task, but more commonly it is thought about\n", "separately in the domains of *Bayesian optimization* or *bandit\n", "learning*.\n", "\n", "We use optimization in machine learning to find the parameters of our\n", "models. We can do that because we have a mathematical representation of\n", "our objective function as a direct function of the parameters.\n", "\n", "Examples in this form of optimization include, what is the best user\n", "interface for presenting adverts? What is the best design for a front\n", "wing for an F1 racing car? Which product should I return top of the list\n", "in response to this user's search?\n", "\n", "Bayesian optimization arises when we can't directly relate the\n", "parameters in the system of interest to our objective through a\n", "mathematical function. For example, what is the mathematical function\n", "that relates a user's experience to the probability that they will buy a\n", "product?\n", "\n", "## Bayesian Optimization\n", "\n", "One approach to these problems is to use machine learning methods to\n", "develop a *surrogate model* for the optimization task. The surrogate\n", "model is a prediction function that attempts to recreate the process we\n", "are finding hard to model. We try to simultaneously fit the surrogate\n", "model and optimize the process.\n", "\n", "## Surrogate Models\n", "\n", "Bayesian optimization methods use a *surrogate model* (normally a\n", "specific form of regression model). They use this to predict how the\n", "real system will perform. The surrogate model makes a prediction (with\n", "an estimate of the uncertainty) of what the response will be to any\n", "given input. Parameters to test are chosen by considering this\n", "prediction. Similar to reinforcement learning, this can be viewed as a\n", "*model-based* approach because the surrogate model can be seen as a\n", "model of the real world. In bandit methods strategies are determined\n", "without turning to a model to motivate them. They are *model free*\n", "methods.\n", "\n", "## Model-Based and Model Free: Performance\n", "\n", "Because of their different philosophies, if a class of prediction\n", "functions is chosen, then a model-based approach might have better\n", "average case performance. At least in terms of *data efficiency*. A\n", "model free approach may well have better worst-case performance though,\n", "because it makes less assumptions about the nature of the data. To put\n", "it another way, making assumptions about the data is helpful if they are\n", "right: and if the model is sensible they'll be right on average.\n", "However, it is unhelpful if the model is wrong. Indeed, it could be\n", "actively damaging. Since we can't usually guarantee the model is\n", "absolutely right, the worst-case performance of a model-based approach\n", "would be poor.\n", "\n", "We have introduced a range of machine learning approaches by focusing on\n", "their use of mathematical functions to replace manually coded systems of\n", "rules. The important characteristic of machine learning is that the form\n", "of these functions, as dictated by their parameters, is determined by\n", "acquiring data from the real world.\n", "\n", "## Deployment \\[edit\\]\n", "\n", "The methods we have introduced are roughly speaking introduced in order\n", "of difficulty of deployment. While supervised learning is more involved\n", "in terms of collection of data, it is the most straightforward method to\n", "deploy once that data is recovered. For this reason, a major focus with\n", "supervised learning should always be on maintaining data quality,\n", "increasing the efficiency and accountability[^10] of the data collection\n", "pipeline and the quality of features used.\n", "\n", "You can also check my blog post on [\"Data Readiness\n", "Levels\"](http://inverseprobability.com/2017/01/12/data-readiness-levels)\n", "You can also check my blog post on [\"The 3Ds of Machine Learning Systems\n", "Design\"](http://inverseprobability.com/2018/11/05/the-3ds-of-machine-learning-systems-design)\n", "\n", "## Where to Deploy?\n", "\n", "In relation to what AI can and can't do today Andrew Ng is quoted as\n", "saying:\n", "\n", "> If a typical person can do a mental task with less than one second of\n", "> thought, we can probably automate it using AI either now or in the\n", "> near future.[^11] Andrew Ng\n", "\n", "## Is this Right?\n", "\n", "I would broadly agree with this quote but only in the context of\n", "supervised learning. If a human expert takes around that amount of time,\n", "then it's also likely we can acquire the data necessary to build a\n", "supervised learning algorithm that can emulate that human's response.\n", "\n", "The picture with regard to unsupervised learning and reinforcement\n", "learning is more clouded.\n", "\n", "One observation is that for *supervised* learning we seem to be moving\n", "beyond the era where very deep machine learning expertise is required to\n", "deploy methods. A solid understanding of machine learning (say to\n", "Masters level) is certainly required, but the quality of the final\n", "result is likely more dependent on domain expertise and the quality of\n", "the data and the information processing pipeline. This seems part of a\n", "wider trend where some of the big successes in machine learning are\n", "moving rapidly from the domain of science to that of engineering.[^12]\n", "\n", "You can also check my blog post on [\"New Directions in Kernels and\n", "Gaussian\n", "Processes\"](http://inverseprobability.com/2016/11/29/new-directions-in-kernels-and-gaussian-processes)\n", "\n", "So if we can only emulate tasks that humans take around a second to do,\n", "how are we managing to deliver on self driving cars? The answer is that\n", "we are constructing engineered systems from sub-components, each of\n", "which is a machine learning subsystem. But they are tied together as a\n", "component based system in line with our traditional engineering\n", "approach. This has an advantage that each component in the system can be\n", "verified before its inclusion. This is important for debugging and\n", "safety. But in practice we can expect these systems to be very brittle.\n", "A human adapts the way in which they drive the car across their\n", "lifetime. A human can react to other road users. In extreme situations,\n", "such as a car jacking, a human can set to one side normal patterns of\n", "behavior, and purposely crash their car to draw attention to the\n", "situation.\n", "\n", "Supervised machine learning solutions are normally trained offline. They\n", "do not adapt when deployed because this makes them less verifiable. But\n", "this compounds the brittleness of our solutions. By deploying our\n", "solutions we actually change the environment in which they operate.\n", "Therefore, it's important that they can be quickly updated to reflect\n", "changing circumstances. This updating happens offline. For a complex\n", "mechanical system, such as a delivery drone, extensive testing of the\n", "system may be required when any component is updated. It is therefore\n", "imperative that these data processing pipelines are well documented so\n", "that they can be redeployed on demand.\n", "\n", "In practice there can be challenges with the false dichotomy between\n", "reproducibility and performance. It is likely that most of our data\n", "scientists are caring less about their ability to redeploy their\n", "pipelines and only about their ability to produce an algorithm that\n", "achieves a particular performance. A key question is how reproducible is\n", "that process? There is a *false* dichotomy because ensuring\n", "reproducibility will typically improve performance as it will make it\n", "easier to run a rigorous set of explorative experiments. A worry is\n", "that, currently, we do not have a way to quantify the scale of this\n", "potential problem within companies.\n", "\n", "## Model Choice\n", "\n", "Common to all machine learning methods is the initial choice of useful\n", "classes of functions. The deep learning revolution is associated with a\n", "particular class of mathematical functions that is proving very\n", "successful in what were seen to be challenging domains: speech, vision,\n", "language. This has meant that significant advances in problems that have\n", "been seen as hard have occurred in artificial intelligence.\n", "\n", "\n", "\n", "\n", "\n", "\n", "# References {#references .unnumbered}\n", "\n", "[^1]: The logarithm of a number less than one is negative, for a number\n", " greater than one the logarithm is positive. So if odds are greater\n", " than evens (odds-on) the log-odds are positive, if the odds are less\n", " than evens (odds-against) the log-odds will be negative.\n", "\n", "[^2]: In classical statistics we often interpret these parameters,\n", " $\\beta$, whereas in machine learning we are normally more interested\n", " in the result of the prediction, and less in the prediction.\n", " Although this is changing with more need for accountability. In\n", " honour of this I normally use $\\boldsymbol{\\beta}$ when I care about\n", " the value of these parameters, and $\\mappingVector$ when I care more\n", " about the quality of the prediction.\n", "\n", "[^3]: Using the test data in your training procedure is a major error in\n", " any machine learning procedure. It is extremely dangerous as it\n", " gives a misleading assessment of the model performance. The [Baidu\n", " ImageNet\n", " scandal](http://inverseprobability.com/2015/06/04/baidu-on-imagenet)\n", " was an example of a team competing in the ImageNet challenge which\n", " did this. The team had announced via the publication pre-print\n", " server Arxiv that they had a world-leading performance on the\n", " ImageNet challenge. This was reported in the mainstream media. Two\n", " weeks later the challenge organizers revealed that the team had\n", " created multiple accounts for checking their test performance more\n", " times than was permitted by the challenge rules. This was then\n", " reported as \"AI's first doping scandal\". The team lead was fired by\n", " Baidu.\n", "\n", "[^4]: These models are quite a lot more complex than the simple\n", " clustering we describe here. They represent a common ancestor\n", " through a cluster center that is then allowed to evolve over time\n", " through a mutation rate. The time of separation between different\n", " species is estimated via these mutation rates.\n", "\n", "[^5]: One way of thinking about this is to flip the model on its side.\n", " Instead of thinking about the $i$th subject and the $j$th\n", " characteristic. Assume that each product is the subject. So, the\n", " $j$th item is thought of as the subject, and each item's\n", " characteristic is given by the rating from a particular user. In\n", " this case symmetries in the model show that the matrix\n", " $\\weightMatrix$ can now be seen as a matrix of *latent variables*\n", " and the matrix $\\latentMatrix$ can be seen as *factor loadings*. So,\n", " you can think of the method as simultaneously doing a dimensionality\n", " reduction on the products and the users. Recommender systems also\n", " use other approaches, some of them based on similarity measures. In\n", " a similarity measure-based recommender system the rating prediction\n", " is given by looking for similar products in the user profile and\n", " scoring the new product with a score that is a weighted sum of those\n", " products.\n", "\n", "[^6]: The interpretation requires you to think of the web as a series of\n", " web pages in a high dimensional space where distances between web\n", " pages are computed by moving along the links (in either direction).\n", " The PageRank is the one-dimensional space that best preserves those\n", " distances in the sense of an L1 norm. The interpretation works\n", " because the smallest eigenvalue of the linkage matrix is the\n", " *largest* eigenvalue of the inverse of the linkage matrix. The\n", " inverse linkage matrix (which would be impossible to compute) embeds\n", " similarities between pages according to how far apart they are via a\n", " random walk along the linkage matrix.\n", "\n", "[^7]: The approach was described early on in the history of machine\n", " learning by Chris Watkins, during his PhD thesis in the 1980s. It is\n", " known as Q-learning. It's recent success in the games domain is\n", " driven by the use of deep learning for the policy and value\n", " functions as well as the use of fast compute to generate and process\n", " very large quantities of data. In its standard form it is not seen\n", " as a very data-efficient approach.\n", "\n", "[^8]: The situation in chess is much easier, firstly the number of\n", " possible moves at any time is about an order of magnitude lower,\n", " meaning the game tree doesn't grow as quickly. Secondly, in chess,\n", " there are well defined value functions. For example, a value\n", " function could be based on adding together the points that are\n", " associated with each piece.\n", "\n", "[^9]: The approach was described early on in the history of machine\n", " learning by Chris Watkins, during his PhD thesis in the 1980s. It is\n", " known as Q-learning. It's recent success in the games domain is\n", " driven by the use of deep learning for the policy and value\n", " functions as well as the use of fast compute to generate and process\n", " very large quantities of data. In its standard form it is not seen\n", " as a very data-efficient approach.\n", "\n", "[^10]: To try and better embody the state of data readiness in\n", " organizations I've been proposing \"Data Readiness Levels\". More\n", " needs to be done in this area to improve the efficiency of the data\n", " science pipeline.\n", "\n", "[^11]: The quote can be found in the Harvard Business Review Article\n", " [\"What Artificial Intelligence Can and Can't Do Right\n", " Now\"](https://hbr.org/2016/11/what-artificial-intelligence-can-and-cant-do-right-now).\n", "\n", "[^12]: This trend was very clear at the moment, [I spoke about\n", " it](%7B%7Bsite.baseurl%20%7D%7D/) at a recent Dagstuhl workshop on\n", " new directions for kernel methods and Gaussian processes." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }