{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Probabilistic Machine Learning\n", "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n", "### 2018-08-25\n", "\n", "**Abstract**: In this talk we review the *probabilistic* approach to machine learning.\n", "We start with a review of probability, and introduce the concepts of\n", "probabilistic modelling. We then apply the approach in practice to Naive\n", "Bayesian classification. In this session we review the Bayesian\n", "formalism in the context of linear models, reviewing initially maximum\n", "likelihood and introducing basis functions as a way of driving\n", "non-linearity in the model.\n", "\n", "$$\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$\n", "\n", "\n", "\n", "\n", "\n", "## What is Machine Learning?\n", "\n", "### What is Machine Learning?\n", "\n", ". . .\n", "\n", "$$ \\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", ". . .\n", "\n", "- **data** : observations, could be actively or passively acquired\n", " (meta-data).\n", "\n", ". . .\n", "\n", "- **model** : assumptions, based on previous experience (other data!\n", " transfer learning etc), or beliefs about the regularities of the\n", " universe. Inductive bias.\n", "\n", ". . .\n", "\n", "- **prediction** : an action to be taken or a categorization or a\n", " quality score.\n", "\n", ". . .\n", "\n", "- Royal Society Report: [Machine Learning: Power and Promise of\n", " Computers that Learn by\n", " Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf)\n", "\n", "### What is Machine Learning?\n", "\n", "$$\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", ". . .\n", "\n", "- To combine data with a model need:\n", "\n", ". . .\n", "\n", "- **a prediction function** $\\mappingFunction(\\cdot)$ includes our\n", " beliefs about the regularities of the universe\n", "\n", ". . .\n", "\n", "- **an objective function** $\\errorFunction(\\cdot)$ defines the cost\n", " of misprediction.\n", "\n", "## Probabilities\n", "\n", "## Movie Body Count Example\n", "\n", "### `pods`\n", "\n", "The `pods` library is a library for supporting open data science (python\n", "open data science). It allows you to load in various data sets and\n", "provides tools for helping teach in the notebook.\n", "\n", "To install pods you can use pip:\n", "\n", "`pip install pods`\n", "\n", "The code is also available on github: \n", "\n", "Once `pods` is installed, it can be imported in the usual manner.\n", "\n", "## Conditioning\n", "\n", "### Probability Review\n", "\n", "- We are interested in trials which result in two random variables,\n", " $X$ and $Y$, each of which has an ‘outcome’ denoted by $x$ or $y$.\n", "- We summarise the notation and terminology for these distributions in\n", " the following table.\n", "\n", "### \n", "\n", " Terminology Mathematical notation Description\n", " ------------- ----------------------- ----------------------------------\n", " joint $P(X=x, Y=y)$ prob. that X=x *and* Y=y\n", " marginal $P(X=x)$ prob. that X=x *regardless of* Y\n", " conditional $P(X=x\\vert Y=y)$ prob. that X=x *given that* Y=y\n", "\n", "
\n", "The different basic probability distributions.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot.prob_diagram(diagrams='../slides/diagrams/mlai')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Pictorial Definition of Probability\n", "\n", "\n", "\n", "[Inspired by lectures from Christopher Bishop]{align=\"right\"}\n", "\n", "### Definition of probability distributions.\n", "\n", " Terminology Definition Probability Notation\n", " -------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------ -------------------------------------------------------------------------------------------\n", " Joint Probability $\\lim_{N\\rightarrow\\infty}\\frac{n_{X=3,Y=4}}{N}$ $P\\left(X=3,Y=4\\right)$\n", " Marginal Probability $\\lim_{N\\rightarrow\\infty}\\frac{n_{X=5}}{N}$ $P\\left(X=5\\right)$\n", " Conditional Probability $\\lim_{N\\rightarrow\\infty}\\frac{n_{X=3,Y=4}}{n_{Y=4}}$ $P\\left(X=3\\vert Y=4\\right)$\n", "\n", "### Notational Details\n", "\n", "- Typically we should write out $P\\left(X=x,Y=y\\right)$.\n", "\n", "- In practice, we often use $P\\left(x,y\\right)$.\n", "- This looks very much like we might write a multivariate function,\n", " *e.g.* $f\\left(x,y\\right)=\\frac{x}{y}$.\n", "- For a multivariate function though,\n", " $f\\left(x,y\\right)\\neq f\\left(y,x\\right)$.\n", "- However $P\\left(x,y\\right)=P\\left(y,x\\right)$ because\n", " $P\\left(X=x,Y=y\\right)=P\\left(Y=y,X=x\\right)$.\n", "- We now quickly review the ‘rules of probability’.\n", "\n", "### Normalization\n", "\n", "*All* distributions are normalized. This is clear from the fact that\n", "$\\sum_{x}n_{x}=N$, which gives\n", "$$\\sum_{x}P\\left(x\\right)={\\lim_{N\\rightarrow\\infty}}\\frac{\\sum_{x}n_{x}}{N}={\\lim_{N\\rightarrow\\infty}}\\frac{N}{N}=1.$$\n", "A similar result can be derived for the marginal and conditional\n", "distributions.\n", "\n", "### The Product Rule\n", "\n", "- $P\\left(x|y\\right)$ is $$\n", " {\\lim_{N\\rightarrow\\infty}}\\frac{n_{x,y}}{n_{y}}.\n", " $$\n", "- $P\\left(x,y\\right)$ is $$\n", " {\\lim_{N\\rightarrow\\infty}}\\frac{n_{x,y}}{N}={\\lim_{N\\rightarrow\\infty}}\\frac{n_{x,y}}{n_{y}}\\frac{n_{y}}{N}\n", " $$ or in other words $$\n", " P\\left(x,y\\right)=P\\left(x|y\\right)P\\left(y\\right).\n", " $$ This is known as the product rule of probability.\n", "\n", "### The Sum Rule\n", "\n", "Ignoring the limit in our definitions: \\* The marginal probability\n", "$P\\left(y\\right)$ is ${\\lim_{N\\rightarrow\\infty}}\\frac{n_{y}}{N}$ . \\*\n", "The joint distribution $P\\left(x,y\\right)$ is\n", "${\\lim_{N\\rightarrow\\infty}}\\frac{n_{x,y}}{N}$. \\*\n", "$n_{y}=\\sum_{x}n_{x,y}$ so $$\n", " {\\lim_{N\\rightarrow\\infty}}\\frac{n_{y}}{N}={\\lim_{N\\rightarrow\\infty}}\\sum_{x}\\frac{n_{x,y}}{N},\n", " $$ in other words $$\n", " P\\left(y\\right)=\\sum_{x}P\\left(x,y\\right).\n", " $$ This is known as the sum rule of probability.\n", "\n", "### Bayes’ Rule\n", "\n", "- From the product rule, $$\n", " P\\left(y,x\\right)=P\\left(x,y\\right)=P\\left(x|y\\right)P\\left(y\\right),$$\n", " so $$\n", " P\\left(y|x\\right)P\\left(x\\right)=P\\left(x|y\\right)P\\left(y\\right)\n", " $$ which leads to Bayes’ rule, $$\n", " P\\left(y|x\\right)=\\frac{P\\left(x|y\\right)P\\left(y\\right)}{P\\left(x\\right)}.\n", " $$\n", "\n", "### Bayes’ Theorem Example\n", "\n", "- There are two barrels in front of you. Barrel One contains 20 apples\n", " and 4 oranges. Barrel Two other contains 4 apples and 8 oranges. You\n", " choose a barrel randomly and select a fruit. It is an apple. What is\n", " the probability that the barrel was Barrel One?\n", "\n", "### Bayes’ Theorem Example: Answer I\n", "\n", "- We are given that: $$\\begin{aligned}\n", " P(\\text{F}=\\text{A}|\\text{B}=1) = & 20/24 \\\\\n", " P(\\text{F}=\\text{A}|\\text{B}=2) = & 4/12 \\\\\n", " P(\\text{B}=1) = & 0.5 \\\\\n", " P(\\text{B}=2) = & 0.5\n", " \\end{aligned}$$\n", "\n", "### Bayes’ Theorem Example: Answer II\n", "\n", "- We use the sum rule to compute: $$\\begin{aligned}\n", " P(\\text{F}=\\text{A}) = & P(\\text{F}=\\text{A}|\\text{B}=1)P(\\text{B}=1) \\\\& + P(\\text{F}=\\text{A}|\\text{B}=2)P(\\text{B}=2) \\\\\n", " = & 20/24\\times 0.5 + 4/12 \\times 0.5 = 7/12\n", " \\end{aligned}$$\n", "- And Bayes’ theorem tells us that: $$\\begin{aligned}\n", " P(\\text{B}=1|\\text{F}=\\text{A}) = & \\frac{P(\\text{F} = \\text{A}|\\text{B}=1)P(\\text{B}=1)}{P(\\text{F}=\\text{A})}\\\\ \n", " = & \\frac{20/24 \\times 0.5}{7/12} = 5/7\n", " \\end{aligned}$$\n", "\n", "### Reading & Exercises\n", "\n", "- @Bishop:book06 on probability distributions: page 12–17 (Section\n", " 1.2).\n", "- Complete Exercise 1.3 in @Bishop:book06.\n", "\n", "### Computing Expectations Example\n", "\n", "- Consider the following distribution.\n", "\n", " $y$ 1 2 3 4\n", " ------------------- ----- ----- ----- -----\n", " $P\\left(y\\right)$ 0.3 0.2 0.1 0.4\n", "\n", "- What is the mean of the distribution?\n", "- What is the standard deviation of the distribution?\n", "- Are the mean and standard deviation representative of the\n", " distribution form?\n", "- What is the expected value of $-\\log P(y)$?\n", "\n", "### Expectations Example: Answer\n", "\n", "- We are given that:\n", "\n", " $y$ 1 2 3 4\n", " ------------------- ------- ------- ------- -------\n", " $P\\left(y\\right)$ 0.3 0.2 0.1 0.4\n", " $y^2$ 1 4 9 16\n", " $-\\log(P(y))$ 1.204 1.609 2.302 0.916\n", "\n", "- Mean:\n", " $1\\times 0.3 + 2\\times 0.2 + 3 \\times 0.1 + 4 \\times 0.4 = 2.6$\n", "- Second moment:\n", " $1 \\times 0.3 + 4 \\times 0.2 + 9 \\times 0.1 + 16 \\times 0.4 = 8.4$\n", "- Variance: $8.4 - 2.6\\times 2.6 = 1.64$\n", "- Standard deviation: $\\sqrt{1.64} = 1.2806$\n", "\n", "### Expectations Example: Answer II\n", "\n", "- We are given that:\n", "\n", " $y$ 1 2 3 4\n", " ------------------- ------- ------- ------- -------\n", " $P\\left(y\\right)$ 0.3 0.2 0.1 0.4\n", " $y^2$ 1 4 9 16\n", " $-\\log(P(y))$ 1.204 1.609 2.302 0.916\n", "\n", "- Expectation $-\\log(P(y))$:\n", " $0.3\\times 1.204 + 0.2\\times 1.609 + 0.1\\times 2.302 +0.4\\times 0.916 = 1.280$\n", "\n", "### Sample Based Approximation Example\n", "\n", "- You are given the following values samples of heights of students," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "$i$ 1 2 3 4 5 6\n", "------- ------ ------ ------ ------ ------ ------\n", "$y_i$ 1.76 1.73 1.79 1.81 1.85 1.80" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What is the sample mean?\n", "- What is the sample variance?\n", "- Can you compute sample approximation expected value of $-\\log P(y)$?\n", "\n", "### Sample Based Approximation Example: Answer\n", "\n", "- We can compute:\n", "\n", " $i$ 1 2 3 4 5 6\n", " --------- -------- -------- -------- -------- -------- --------\n", " $y_i$ 1.76 1.73 1.79 1.81 1.85 1.80\n", " $y^2_i$ 3.0976 2.9929 3.2041 3.2761 3.4225 3.2400\n", "\n", "- Mean: $\\frac{1.76 + 1.73 + 1.79 + 1.81 + 1.85 + 1.80}{6} = 1.79$\n", "- Second moment: \\$\n", " \\frac{3.0976 + 2.9929 + 3.2041 + 3.2761 + 3.4225 + 3.2400}{6} =\n", " 3.2055\\$\n", "- Variance: $3.2055 - 1.79\\times1.79 = 1.43\\times 10^{-3}$\n", "- Standard deviation: $0.0379$\n", "- No, you can’t compute it. You don’t have access to $P(y)$ directly.\n", "\n", "### Sample Based Approximation Example\n", "\n", "- You are given the following values samples of heights of students," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "$i$ 1 2 3 4 5 6\n", "------- ------ ------ ------ ------ ------ ------\n", "$y_i$ 1.76 1.73 1.79 1.81 1.85 1.80" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Actually these “data” were sampled from a Gaussian with mean 1.7 and\n", " standard deviation 0.15. Are your estimates close to the real\n", " values? If not why not?\n", "\n", "### Probabilistic Modelling\n", "\n", "- Probabilistically we want, $$\n", " p(\\dataScalar_*|\\dataVector, \\inputMatrix, \\inputVector_*),\n", " $$ $\\dataScalar_*$ is a test output $\\inputVector_*$ is a test input\n", " $\\inputMatrix$ is a training input matrix $\\dataVector$ is training\n", " outputs\n", "\n", "### Joint Model of World\n", "\n", "$$\n", "p(\\dataScalar_*|\\dataVector, \\inputMatrix, \\inputVector_*) = \\int p(\\dataScalar_*|\\inputVector_*, \\mappingMatrix) p(\\mappingMatrix | \\dataVector, \\inputMatrix) \\text{d} \\mappingMatrix\n", "$$\n", "\n", ". . .\n", "\n", "$\\mappingMatrix$ contains $\\mappingMatrix_1$ and $\\mappingMatrix_2$\n", "\n", "$p(\\mappingMatrix | \\dataVector, \\inputMatrix)$ is posterior density\n", "\n", "### Likelihood\n", "\n", "$p(\\dataScalar|\\inputVector, \\mappingMatrix)$ is the *likelihood* of\n", "data point\n", "\n", ". . .\n", "\n", "Normally assume independence: $$\n", "p(\\dataVector|\\inputMatrix, \\mappingMatrix) \\prod_{i=1}^\\numData p(\\dataScalar_i|\\inputVector_i, \\mappingMatrix),$$\n", "\n", "### Likelihood and Prediction Function\n", "\n", "$$\n", "p(\\dataScalar_i | \\mappingFunction(\\inputVector_i)) = \\frac{1}{\\sqrt{2\\pi \\dataStd^2}} \\exp\\left(-\\frac{\\left(\\dataScalar_i - \\mappingFunction(\\inputVector_i)\\right)^2}{2\\dataStd^2}\\right)\n", "$$\n", "\n", "### Unsupervised Learning\n", "\n", "- Can also consider priors over latents $$\n", " p(\\dataVector_*|\\dataVector) = \\int p(\\dataVector_*|\\inputMatrix_*, \\mappingMatrix) p(\\mappingMatrix | \\dataVector, \\inputMatrix) p(\\inputMatrix) p(\\inputMatrix_*) \\text{d} \\mappingMatrix \\text{d} \\inputMatrix \\text{d}\\inputMatrix_*\n", " $$\n", "\n", "- This gives *unsupervised learning*.\n", "\n", "### Probabilistic Inference\n", "\n", "- Data: $\\dataVector$\n", "\n", "- Model: $p(\\dataVector, \\dataVector^*)$\n", "\n", "- Prediction: $p(\\dataVector^*| \\dataVector)$\n", "\n", "### Graphical Models\n", "\n", "- Represent joint distribution through *conditional dependencies*.\n", "- E.g. Markov chain\n", "\n", "$$p(\\dataVector) = p(\\dataScalar_\\numData | \\dataScalar_{\\numData-1}) p(\\dataScalar_{\\numData-1}|\\dataScalar_{\\numData-2}) \\dots p(\\dataScalar_{2} | \\dataScalar_{1})$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import daft\n", "from matplotlib import rc\n", "\n", "rc(\"font\", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)\n", "rc(\"text\", usetex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pgm = daft.PGM(shape=[3, 1],\n", " origin=[0, 0], \n", " grid_unit=5, \n", " node_unit=1.9, \n", " observed_style='shaded',\n", " line_width=3)\n", "\n", "\n", "pgm.add_node(daft.Node(\"y_1\", r\"$y_1$\", 0.5, 0.5, fixed=False))\n", "pgm.add_node(daft.Node(\"y_2\", r\"$y_2$\", 1.5, 0.5, fixed=False))\n", "pgm.add_node(daft.Node(\"y_3\", r\"$y_3$\", 2.5, 0.5, fixed=False))\n", "pgm.add_edge(\"y_1\", \"y_2\")\n", "pgm.add_edge(\"y_2\", \"y_3\")\n", "\n", "pgm.render().figure.savefig(\"../slides/diagrams/ml/markov.svg\", transparent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### \n", "\n", "Predict Perioperative Risk of Clostridium Difficile Infection Following\n", "Colon Surgery [@Steele:predictive12]\n", "\n", "\n", "\n", "### Classification\n", "\n", "- We are given a data set containing 'inputs', $\\inputMatrix$ and\n", " 'targets', $\\dataVector$.\n", "- Each data point consists of an input vector $\\inputVector_i$ and a\n", " class label, $\\dataScalar_i$.\n", "- For binary classification assume $\\dataScalar_i$ should be either\n", " $1$ (yes) or $-1$ (no).\n", "- Input vector can be thought of as features.\n", "\n", "### Discrete Probability\n", "\n", "- Algorithms based on *prediction* function and *objective* function.\n", "- For regression the *codomain* of the functions, $f(\\inputMatrix)$\n", " was the real numbers or sometimes real vectors.\n", "- In classification we are given an input vector, $\\inputVector$, and\n", " an associated label, $\\dataScalar$ which either takes the value $-1$\n", " or $1$.\n", "\n", "### Classification Examples\n", "\n", "- Classifiying hand written digits from binary images (automatic zip\n", " code reading)\n", "- Detecting faces in images (e.g. digital cameras).\n", "- Who a detected face belongs to (e.g. Picasa, Facebook, DeepFace,\n", " GaussianFace)\n", "- Classifying type of cancer given gene expression data.\n", "- Categorization of document types (different types of news article on\n", " the internet)\n", "\n", "### Reminder on the Term \"Bayesian\"\n", "\n", "- We use Bayes' rule to invert probabilities in the Bayesian approach.\n", "- Bayesian is not named after Bayes' rule (v. common confusion).\n", "- The term Bayesian refers to the treatment of the parameters as\n", " stochastic variables.\n", "- Proposed by @Laplace:memoire74 and @Bayes:doctrine63 independently.\n", "- For early statisticians this was very controversial (Fisher et al).\n", "\n", "### Reminder on the Term \"Bayesian\"\n", "\n", "- The use of Bayes' rule does *not* imply you are being Bayesian.\n", "- It is just an application of the product rule of probability.\n", "\n", "### Bernoulli Distribution\n", "\n", "- Binary classification: need a probability distribution for discrete\n", " variables.\n", "- Discrete probability is in some ways easier:\n", " $P(\\dataScalar=1) = \\pi$ & specify distribution as a table.\n", "- Instead of $\\dataScalar=-1$ for negative class we take\n", " $\\dataScalar=0$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "$\\dataScalar$ 0 1\n", " ------------------ ----------- -------\n", " $P(\\dataScalar)$ $(1-\\pi)$ $\\pi$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the [Bernoulli\n", "distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution).\n", "\n", "### Mathematical Switch\n", "\n", "- The Bernoulli distribution $$\n", " P(\\dataScalar) = \\pi^\\dataScalar (1-\\pi)^{(1-\\dataScalar)}\n", " $$\n", "\n", "- Is a clever trick for switching probabilities, as code it would be" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bernoulli(y_i, pi):\n", " if y_i == 1:\n", " return pi\n", " else:\n", " return 1-pi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jacob Bernoulli's Bernoulli\n", "\n", "- Bernoulli described the Bernoulli distribution in terms of an 'urn'\n", " filled with balls.\n", "- There are red and black balls. There is a fixed number of balls in\n", " the urn.\n", "- The portion of red balls is given by $\\pi$.\n", "- For this reason in Bernoulli's distribution there is *epistemic*\n", " uncertainty about the distribution parameter.\n", "\n", "###" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "pods.notebook.display_google_book(id='CF4UAAAAQAAJ', page='PA87')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.one_figsize)\n", "plot.bernoulli_urn(ax, diagrams='../slides/diagrams/ml/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jacob Bernoulli's Bernoulli\n", "\n", "\n", "\n", "### Thomas Bayes's Bernoulli\n", "\n", "- Bayes described the Bernoulli distribution (he didn't call it that!)\n", " in terms of a table and two balls.\n", "- Each ball is rolled so it comes to rest at a uniform distribution\n", " across the table.\n", "- The first ball comes to rest at a position that is a $\\pi$ times the\n", " width of table.\n", "- After placing the first ball you consider whether a second would\n", " land to the left or the right.\n", "- For this reason in Bayes's distribution there is considered to be\n", " *aleatoric* uncertainty about the distribution parameter.\n", "\n", "### Thomas Bayes' Bernoulli" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import teaching_plots as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.one_figsize)\n", "plot.bayes_billiard(ax, diagrams='../slides/diagrams/ml/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\\\startslides{bayes\\_billiard}{1}{10}\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pods.notebook.display_plots('bayes-billiard{counter:0>3}.svg', \n", " directory='../slides/diagrams/ml', \n", " counter=IntSlider(0,0,9,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Maximum Likelihood in the Bernoulli\n", "\n", "- Assume data, $\\dataVector$ is binary vector length $\\numData$.\n", "- Assume each value was sampled independently from the Bernoulli\n", " distribution, given probability $\\pi$ $$\n", " p(\\dataVector|\\pi) = \\prod_{i=1}^{\\numData} \\pi^{\\dataScalar_i} (1-\\pi)^{1-\\dataScalar_i}.\n", " $$\n", "\n", "### Negative Log Likelihood\n", "\n", "- Minimize the negative log likelihood $$\\begin{align*}\n", " \\errorFunction(\\pi)& = -\\log p(\\dataVector|\\pi)\\\\ \n", " & = -\\sum_{i=1}^{\\numData} \\dataScalar_i \\log \\pi - \\sum_{i=1}^{\\numData} (1-\\dataScalar_i) \\log(1-\\pi),\n", " \\end{align*}$$\n", "- Take gradient with respect to the parameter $\\pi$.\n", " $$\\frac{\\text{d}\\errorFunction(\\pi)}{\\text{d}\\pi} = -\\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\pi} + \\frac{\\sum_{i=1}^{\\numData} (1-\\dataScalar_i)}{1-\\pi},$$\n", "\n", "### Fixed Point\n", "\n", "- Stationary point: set derivative to zero\n", " $$0 = -\\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\pi} + \\frac{\\sum_{i=1}^{\\numData} (1-\\dataScalar_i)}{1-\\pi},$$\n", "\n", "- Rearrange to form\n", " $$(1-\\pi)\\sum_{i=1}^{\\numData} \\dataScalar_i = \\pi\\sum_{i=1}^{\\numData} (1-\\dataScalar_i),$$\n", "\n", "- Giving\n", " $$\\sum_{i=1}^{\\numData} \\dataScalar_i = \\pi\\left(\\sum_{i=1}^{\\numData} (1-\\dataScalar_i) + \\sum_{i=1}^{\\numData} \\dataScalar_i\\right),$$\n", "\n", "### Solution\n", "\n", "- Recognise that\n", " $\\sum_{i=1}^{\\numData} (1-\\dataScalar_i) + \\sum_{i=1}^{\\numData} \\dataScalar_i = n$\n", " so we have\n", " $$\\pi = \\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\numData}$$\n", "\n", "- Estimate the probability associated with the Bernoulli by setting it\n", " to the number of observed positives, divided by the total length of\n", " $\\dataScalar$.\n", "- Makes intiutive sense.\n", "- What's your best guess of probability for coin toss is heads when\n", " you get 47 heads from 100 tosses?\n", "\n", "### Exercise\n", "\n", "Show that the maximum likelihood solution we have found is a *minimum*\n", "for our objective.\n", "\n", "### Bayes' Rule Reminder\n", "\n", "$$\n", "\\text{posterior} =\n", "\\frac{\\text{likelihood}\\times\\text{prior}}{\\text{marginal likelihood}}\n", "$$\n", "\n", "Four components:\n", "\n", "1. Prior distribution\n", "2. Likelihood\n", "3. Posterior distribution\n", "4. Marginal likelihood\n", "\n", "### Naive Bayes Classifiers\n", "\n", "- Probabilistic Machine Learning: place probability distributions (or\n", " densities) over all the variables of interest.\n", "- In *naive Bayes* this is exactly what we do.\n", "\n", "- Form a classification algorithm by modelling the *joint* density of\n", " our observations.\n", "\n", "- Need to make assumption about joint density.\n", "\n", "### Assumptions about Density\n", "\n", "- Make assumptions to reduce the number of parameters we need to\n", " optimise.\n", "- Given label data $\\dataVector$ and the inputs $\\inputMatrix$ could\n", " specify joint density of all potential values of $\\dataVector$ and\n", " $\\inputMatrix$, $p(\\dataVector, \\inputMatrix)$.\n", "- If $\\inputMatrix$ and $\\dataVector$ are training data.\n", "- If $\\inputVector^*$ is a test input and $\\dataScalar^*$ a test\n", " location we want $$\n", " p(\\dataScalar^*|\\inputMatrix, \\dataVector, \\inputVector^*),\n", " $$\n", "\n", "### Answer from Rules of Probability\n", "\n", "- Compute this distribution using the product and sum rules.\n", "- Need the probability associated with all possible combinations of\n", " $\\dataVector$ and $\\inputMatrix$.\n", "- There are $2^{\\numData}$ possible combinations for the vector\n", " $\\dataVector$\n", "- Probability for each of these combinations must be jointly specified\n", " along with the joint density of the matrix $\\inputMatrix$,\n", "- Also need to *extend* the density for any chosen test location\n", " $\\inputVector^*$.\n", "\n", "### Naive Bayes Assumptions\n", "\n", "- In *naive Bayes* we make certain simplifying assumptions that allow\n", " us to perform all of the above in practice.\n", "\n", "1. Data Conditional Independence\n", "2. Feature conditional independence\n", "3. Marginal density for $\\dataScalar$.\n", "\n", "### Data Conditional Independence\n", "\n", "- Given model parameters $\\paramVector$ we assume that all data points\n", " in the model are independent. $$\n", " p(\\dataScalar^*, \\inputVector^*, \\dataVector, \\inputMatrix|\\paramVector) = p(\\dataScalar^*, \\inputVector^*|\\paramVector)\\prod_{i=1}^{\\numData} p(\\dataScalar_i, \\inputVector_i | \\paramVector).\n", " $$\n", "\n", "- This is a conditional independence assumption.\n", "\n", "- We also make similar assumptions for regression (where\n", " $\\paramVector = \\left\\{\\mappingVector,\\dataStd^2\\right\\}$).\n", "\n", "- Here we assume *joint* density of $\\dataVector$ and $\\inputMatrix$\n", " is independent across the data given the parameters.\n", "\n", "### Bayes Classifier\n", "\n", "Computing posterior distribution in this case becomes easier, this is\n", "known as the 'Bayes classifier'.\n", "\n", "### Feature Conditional Independence\n", "\n", "- Particular to naive Bayes: assume *features* are also conditionally\n", " independent, given param *and* the label.\n", " $$p(\\inputVector_i | \\dataScalar_i, \\paramVector) = \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i,\\paramVector)$$\n", " where $\\dataDim$ is the dimensionality of our inputs.\n", "- This is known as the *naive Bayes* assumption.\n", "- Bayes classifier + feature conditional independence.\n", "\n", "### Marginal Density for $\\dataScalar_i$\n", "\n", "- To specify the joint distribution we also need the marginal for\n", " $p(\\dataScalar_i)$\n", " $$p(\\inputScalar_{i,j},\\dataScalar_i| \\paramVector) = p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i).$$\n", "\n", "- Because $\\dataScalar_i$ is binary the *Bernoulli* density makes a\n", " suitable choice for our prior over $\\dataScalar_i$,\n", " $$p(\\dataScalar_i|\\pi) = \\pi^{\\dataScalar_i} (1-\\pi)^{1-\\dataScalar_i}$$\n", " where $\\pi$ now has the interpretation as being the *prior*\n", " probability that the classification should be positive.\n", "\n", "### Joint Density for Naive Bayes\n", "\n", "- This allows us to write down the full joint density of the training\n", " data, $$\n", " p(\\dataVector, \\inputMatrix|\\paramVector, \\pi) = \\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)\n", " $$ which can now be fit by maximum likelihood.\n", "\n", "### Objective Function\n", "\n", "$$\\begin{align*}\n", "\\errorFunction(\\paramVector, \\pi)& = -\\log p(\\dataVector, \\inputMatrix|\\paramVector, \\pi) \\\\ &= -\\sum_{i=1}^{\\numData} \\sum_{j=1}^{\\dataDim} \\log p(\\inputScalar_{i, j}|\\dataScalar_i, \\paramVector) - \\sum_{i=1}^{\\numData} \\log p(\\dataScalar_i|\\pi),\n", "\\end{align*}$$\n", "\n", "### Maximum Likelihood\n", "\n", "### Fit Prior\n", "\n", "- We can minimize prior. For Bernoulli likelihood over the labels we\n", " have, $$\\begin{align*}\n", " \\errorFunction(\\pi) & = - \\sum_{i=1}^{\\numData}\\log p(\\dataScalar_i|\\pi)\\\\ & = -\\sum_{i=1}^{\\numData} \\dataScalar_i \\log \\pi - \\sum_{i=1}^{\\numData} (1-\\dataScalar_i) \\log (1-\\pi)\n", " \\end{align*}$$\n", "- Solution from above is $$\n", " \\pi = \\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\numData}.\n", " $$\n", "\n", "### Fit Conditional\n", "\n", "- Minimize conditional distribution: $$\n", " \\errorFunction(\\paramVector) = -\\sum_{i=1}^{\\numData} \\sum_{j=1}^{\\dataDim} \\log p(\\inputScalar_{i, j} |\\dataScalar_i, \\paramVector),\n", " $$\n", "- Implies making an assumption about it's form.\n", "- The right assumption will depend on the data.\n", "- E.g. for real valued data, use a Gaussian $$\n", " p(\\inputScalar_{i, j} | \\dataScalar_i,\\paramVector) =\n", " \\frac{1}{\\sqrt{2\\pi \\dataStd_{\\dataScalar_i,j}^2}} \\exp \\left(-\\frac{(\\inputScalar_{i,j} - \\mu_{\\dataScalar_i,\n", " j})^2}{\\dataStd_{\\dataScalar_i,j}^2}\\right),\n", " $$\n", "\n", "### Compute Posterior for Test Point Label\n", "\n", "- We know that $$\n", " P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector)p(\\dataVector,\\inputMatrix, \\inputVector^*|\\paramVector) = p(\\dataScalar*, \\dataVector, \\inputMatrix,\\inputVector^*| \\paramVector)\n", " $$\n", "- This implies $$\n", " P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector) = \\frac{p(\\dataScalar*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector)}{p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector)}\n", " $$\n", "\n", "### Compute Posterior for Test Point Label\n", "\n", "- From conditional independence assumptions $$\n", " p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector) = \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)\n", " $$\n", "- We also need $$\n", " p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector)$$ which\n", " can be found from\n", " $$p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector)\n", " $$\n", "- Using the *sum rule* of probability, $$\n", " p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector) = \\sum_{\\dataScalar^*=0}^1 p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector).\n", " $$\n", "\n", "### Independence Assumptions\n", "\n", "- From independence assumptions $$\n", " p(\\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector) = \\sum_{\\dataScalar^*=0}^1 \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi).\n", " $$\n", "- Substitute both forms to recover, $$\n", " P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector) = \\frac{\\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)}{\\sum_{\\dataScalar^*=0}^1 \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)}\n", " $$\n", "\n", "### Cancelation\n", "\n", "- Note training data terms cancel. $$\n", " p(\\dataScalar^*| \\inputVector^*, \\paramVector) = \\frac{\\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)}{\\sum_{\\dataScalar^*=0}^1 \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)}\n", " $$\n", "- This formula is also fairly straightforward to implement for\n", " different class conditional distributions.\n", "\n", "### Laplace Smoothing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PA16')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pseudo Counts\n", "\n", "$$\n", "\\pi = \\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i + 1}{\\numData + 2}\n", "$$\n", "\n", "### Exercise\n", "\n", "How can you improve your classification, are all the features equally\n", "valid? Are some features more helpful than others? What happens if you\n", "remove features that appear to be less helpful. How might you select\n", "such features?\n", "\n", "### Exercise\n", "\n", "We have decided to classify positive if probability of R rating is\n", "greater than 0.5. This has led us to accidentally classify some films as\n", "'safe for children' when the aren't in actuallity. Imagine you wish to\n", "ensure that the film is safe for children. With your test set how low do\n", "you have to set the threshold to avoid all the false negatives (i.e.\n", "films where you said it wasn't R-rated, but in actuality it was?\n", "\n", "### Naive Bayes Summary\n", "\n", "- Model *full* joint distribution of data,\n", " $p(\\dataVector, \\inputMatrix | \\paramVector, \\pi)$\n", "- Make conditional independence assumptions about the data.\n", "- feature conditional independence\n", "- data conditional independence\n", "- Fast to implement, works on very large data.\n", "- Despite simple assumptions can perform better than expected.\n", "\n", "### Other Reading\n", "\n", "- Chapter 5 of @Rogers:book11 up to pg 179 (Section 5.1, and 5.2 up to\n", " 5.2.2).\n", "\n", "### References {#references .unnumbered}" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }