{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \n",
    "\n",
    "\n",
    "$$\n",
    "\\newcommand{\\tk}[1]{}\n",
    "%\\newcommand{\\tk}[1]{\\textbf{TK}: #1}\n",
    "\\newcommand{\\Amatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n",
    "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n",
    "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n",
    "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n",
    "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n",
    "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n",
    "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n",
    "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n",
    "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n",
    "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n",
    "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n",
    "\\newcommand{\\aMatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\aScalar}{a}\n",
    "\\newcommand{\\aVector}{\\mathbf{a}}\n",
    "\\newcommand{\\acceleration}{a}\n",
    "\\newcommand{\\bMatrix}{\\mathbf{B}}\n",
    "\\newcommand{\\bScalar}{b}\n",
    "\\newcommand{\\bVector}{\\mathbf{b}}\n",
    "\\newcommand{\\basisFunc}{\\phi}\n",
    "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n",
    "\\newcommand{\\basisFunction}{\\phi}\n",
    "\\newcommand{\\basisLocation}{\\mu}\n",
    "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n",
    "\\newcommand{\\basisScalar}{\\basisFunction}\n",
    "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n",
    "\\newcommand{\\activationFunction}{\\phi}\n",
    "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n",
    "\\newcommand{\\activationScalar}{\\basisFunction}\n",
    "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n",
    "\\newcommand{\\bigO}{\\mathcal{O}}\n",
    "\\newcommand{\\binomProb}{\\pi}\n",
    "\\newcommand{\\cMatrix}{\\mathbf{C}}\n",
    "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n",
    "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n",
    "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n",
    "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n",
    "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n",
    "\\newcommand{\\centeredKernelScalar}{b}\n",
    "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n",
    "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n",
    "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n",
    "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n",
    "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n",
    "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n",
    "\\newcommand{\\coregionalizationScalar}{b}\n",
    "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n",
    "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n",
    "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n",
    "\\newcommand{\\covarianceScalar}{c}\n",
    "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n",
    "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n",
    "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n",
    "\\newcommand{\\croupierScalar}{s}\n",
    "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n",
    "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n",
    "\\newcommand{\\dataDim}{p}\n",
    "\\newcommand{\\dataIndex}{i}\n",
    "\\newcommand{\\dataIndexTwo}{j}\n",
    "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n",
    "\\newcommand{\\dataScalar}{y}\n",
    "\\newcommand{\\dataSet}{\\mathcal{D}}\n",
    "\\newcommand{\\dataStd}{\\sigma}\n",
    "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n",
    "\\newcommand{\\decayRate}{d}\n",
    "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n",
    "\\newcommand{\\degreeScalar}{d}\n",
    "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n",
    "% Already defined by latex\n",
    "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n",
    "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n",
    "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n",
    "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n",
    "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n",
    "\\newcommand{\\displacement}{x}\n",
    "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n",
    "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n",
    "\\newcommand{\\distanceScalar}{d}\n",
    "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n",
    "\\newcommand{\\eigenvaltwo}{\\ell}\n",
    "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n",
    "\\newcommand{\\eigenvalue}{\\lambda}\n",
    "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n",
    "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n",
    "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n",
    "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n",
    "\\newcommand{\\eigenvectorScalar}{u}\n",
    "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n",
    "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n",
    "\\newcommand{\\eigenvectwoScalar}{v}\n",
    "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n",
    "\\newcommand{\\errorFunction}{E}\n",
    "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n",
    "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n",
    "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n",
    "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n",
    "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n",
    "\\newcommand{\\eye}{\\mathbf{I}}\n",
    "\\newcommand{\\fantasyDim}{r}\n",
    "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n",
    "\\newcommand{\\fantasyScalar}{z}\n",
    "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n",
    "\\newcommand{\\featureStd}{\\varsigma}\n",
    "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n",
    "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n",
    "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n",
    "\\newcommand{\\given}{|}\n",
    "\\newcommand{\\half}{\\frac{1}{2}}\n",
    "\\newcommand{\\heaviside}{H}\n",
    "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n",
    "\\newcommand{\\hiddenScalar}{h}\n",
    "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n",
    "\\newcommand{\\identityMatrix}{\\eye}\n",
    "\\newcommand{\\inducingInputScalar}{z}\n",
    "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n",
    "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n",
    "\\newcommand{\\inducingScalar}{u}\n",
    "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n",
    "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n",
    "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n",
    "\\newcommand{\\inputDim}{q}\n",
    "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n",
    "\\newcommand{\\inputScalar}{x}\n",
    "\\newcommand{\\inputSpace}{\\mathcal{X}}\n",
    "\\newcommand{\\inputVals}{\\inputVector}\n",
    "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n",
    "\\newcommand{\\iterNum}{k}\n",
    "\\newcommand{\\kernel}{\\kernelScalar}\n",
    "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n",
    "\\newcommand{\\kernelScalar}{k}\n",
    "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n",
    "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n",
    "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n",
    "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n",
    "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n",
    "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n",
    "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n",
    "\\newcommand{\\lagrangian}{L}\n",
    "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n",
    "\\newcommand{\\laplacianFactorScalar}{m}\n",
    "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n",
    "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\laplacianScalar}{\\ell}\n",
    "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n",
    "\\newcommand{\\latentDim}{q}\n",
    "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n",
    "\\newcommand{\\latentDistanceScalar}{\\delta}\n",
    "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n",
    "\\newcommand{\\latentForce}{f}\n",
    "\\newcommand{\\latentFunction}{u}\n",
    "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n",
    "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n",
    "\\newcommand{\\latentIndex}{j}\n",
    "\\newcommand{\\latentScalar}{z}\n",
    "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n",
    "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n",
    "\\newcommand{\\learnRate}{\\eta}\n",
    "\\newcommand{\\lengthScale}{\\ell}\n",
    "\\newcommand{\\rbfWidth}{\\ell}\n",
    "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n",
    "\\newcommand{\\likelihoodFunction}{L}\n",
    "\\newcommand{\\locationScalar}{\\mu}\n",
    "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n",
    "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n",
    "\\newcommand{\\mappingFunction}{f}\n",
    "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n",
    "\\newcommand{\\mappingFunctionTwo}{g}\n",
    "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n",
    "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n",
    "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n",
    "\\newcommand{\\scaleScalar}{s}\n",
    "\\newcommand{\\mappingScalar}{w}\n",
    "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n",
    "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n",
    "\\newcommand{\\mappingScalarTwo}{v}\n",
    "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n",
    "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n",
    "\\newcommand{\\maxIters}{K}\n",
    "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\meanScalar}{\\mu}\n",
    "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n",
    "\\newcommand{\\meanTwoScalar}{m}\n",
    "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n",
    "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n",
    "\\newcommand{\\mrnaConcentration}{m}\n",
    "\\newcommand{\\naturalFrequency}{\\omega}\n",
    "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n",
    "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n",
    "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n",
    "\\newcommand{\\noiseScalar}{\\epsilon}\n",
    "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n",
    "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n",
    "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n",
    "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n",
    "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n",
    "\\newcommand{\\numActive}{m}\n",
    "\\newcommand{\\numBasisFunc}{m}\n",
    "\\newcommand{\\numComponents}{m}\n",
    "\\newcommand{\\numComps}{K}\n",
    "\\newcommand{\\numData}{n}\n",
    "\\newcommand{\\numFeatures}{K}\n",
    "\\newcommand{\\numHidden}{h}\n",
    "\\newcommand{\\numInducing}{m}\n",
    "\\newcommand{\\numLayers}{\\ell}\n",
    "\\newcommand{\\numNeighbors}{K}\n",
    "\\newcommand{\\numSequences}{s}\n",
    "\\newcommand{\\numSuccess}{s}\n",
    "\\newcommand{\\numTasks}{m}\n",
    "\\newcommand{\\numTime}{T}\n",
    "\\newcommand{\\numTrials}{S}\n",
    "\\newcommand{\\outputIndex}{j}\n",
    "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n",
    "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n",
    "\\newcommand{\\parameterScalar}{\\theta}\n",
    "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n",
    "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n",
    "\\newcommand{\\precisionScalar}{j}\n",
    "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n",
    "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n",
    "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n",
    "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n",
    "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n",
    "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n",
    "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n",
    "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n",
    "\\newcommand{\\responsibility}{r}\n",
    "\\newcommand{\\rotationScalar}{r}\n",
    "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n",
    "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n",
    "\\newcommand{\\sampleCovScalar}{s}\n",
    "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n",
    "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n",
    "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n",
    "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n",
    "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n",
    "\\newcommand{\\singularvalue}{\\ell}\n",
    "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n",
    "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n",
    "\\newcommand{\\sorth}{\\mathbf{u}}\n",
    "\\newcommand{\\spar}{\\lambda}\n",
    "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n",
    "\\newcommand{\\BasalRate}{B}\n",
    "\\newcommand{\\DampingCoefficient}{C}\n",
    "\\newcommand{\\DecayRate}{D}\n",
    "\\newcommand{\\Displacement}{X}\n",
    "\\newcommand{\\LatentForce}{F}\n",
    "\\newcommand{\\Mass}{M}\n",
    "\\newcommand{\\Sensitivity}{S}\n",
    "\\newcommand{\\basalRate}{b}\n",
    "\\newcommand{\\dampingCoefficient}{c}\n",
    "\\newcommand{\\mass}{m}\n",
    "\\newcommand{\\sensitivity}{s}\n",
    "\\newcommand{\\springScalar}{\\kappa}\n",
    "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n",
    "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n",
    "\\newcommand{\\tfConcentration}{p}\n",
    "\\newcommand{\\tfDecayRate}{\\delta}\n",
    "\\newcommand{\\tfMrnaConcentration}{f}\n",
    "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n",
    "\\newcommand{\\velocity}{v}\n",
    "\\newcommand{\\sufficientStatsScalar}{g}\n",
    "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n",
    "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n",
    "\\newcommand{\\switchScalar}{s}\n",
    "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n",
    "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n",
    "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n",
    "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n",
    "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n",
    "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n",
    "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n",
    "\\newcommand{\\vScalar}{v}\n",
    "\\newcommand{\\vVector}{\\mathbf{v}}\n",
    "\\newcommand{\\vMatrix}{\\mathbf{V}}\n",
    "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n",
    "% Already defined by latex\n",
    "%\\newcommand{\\vec}{#1:}\n",
    "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n",
    "\\newcommand{\\weightScalar}{w}\n",
    "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n",
    "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n",
    "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n",
    "\\newcommand{\\weightedAdjacencyScalar}{a}\n",
    "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n",
    "\\newcommand{\\onesVector}{\\mathbf{1}}\n",
    "\\newcommand{\\zerosVector}{\\mathbf{0}}\n",
    "$$\n",
    "\n",
    "<!-- Front matter -->\n",
    "<!-- Front matter -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!---->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!--Back matter-->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->\n",
    "# What is Machine Learning? \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/what-is-ml.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "What is machine learning? At its most basic level machine learning is a\n",
    "combination of\n",
    "\n",
    "$$\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n",
    "\n",
    "where *data* is our observations. They can be actively or passively\n",
    "acquired (meta-data). The *model* contains our assumptions, based on\n",
    "previous experience. That experience can be other data, it can come from\n",
    "transfer learning, or it can merely be our beliefs about the\n",
    "regularities of the universe. In humans our models include our inductive\n",
    "biases. The *prediction* is an action to be taken or a categorization or\n",
    "a quality score. The reason that machine learning has become a mainstay\n",
    "of artificial intelligence is the importance of predictions in\n",
    "artificial intelligence. The data and the model are combined through\n",
    "computation.\n",
    "\n",
    "In practice we normally perform machine learning using two functions. To\n",
    "combine data with a model we typically make use of:\n",
    "\n",
    "**a prediction function** a function which is used to make the\n",
    "predictions. It includes our beliefs about the regularities of the\n",
    "universe, our assumptions about how the world works, e.g. smoothness,\n",
    "spatial similarities, temporal similarities.\n",
    "\n",
    "**an objective function** a function which defines the cost of\n",
    "misprediction. Typically it includes knowledge about the world's\n",
    "generating processes (probabilistic objectives) or the costs we pay for\n",
    "mispredictions (empiricial risk minimization).\n",
    "\n",
    "The combination of data and model through the prediction function and\n",
    "the objectie function leads to a *learning algorithm*. The class of\n",
    "prediction functions and objective functions we can make use of is\n",
    "restricted by the algorithms they lead to. If the prediction function or\n",
    "the objective function are too complex, then it can be difficult to find\n",
    "an appropriate learning algorithm. Much of the acdemic field of machine\n",
    "learning is the quest for new learning algorithms that allow us to bring\n",
    "different types of models and data together.\n",
    "\n",
    "A useful reference for state of the art in machine learning is the UK\n",
    "Royal Society Report, [Machine Learning: Power and Promise of Computers\n",
    "that Learn by\n",
    "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n",
    "\n",
    "You can also check my post blog post on [What is Machine\n",
    "Learning?](http://inverseprobability.com/2017/07/17/what-is-machine-learning)..\n",
    "\n",
    "## Olympic Marathon Data \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/olympic-marathon-data.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"70%\">\n",
    "-   Gold medal times for Olympic Marathon since 1896.\n",
    "-   Marathons before 1924 didn't have a standardised distance.\n",
    "-   Present results using pace per km.\n",
    "-   In 1904 Marathon was badly organised leading to very slow times.\n",
    "\n",
    "</td>\n",
    "<td width=\"30%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/Stephen_Kiprotich.jpg\" style=\"width:100%\">\n",
    "<small>Image from Wikimedia Commons <http://bit.ly/16kMKHQ></small>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "The first thing we will do is load a standard data set for regression\n",
    "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n",
    "winners for the Olympics from 1896 to present. First we load in the data\n",
    "and plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "offset = y.mean()\n",
    "scale = np.sqrt(y.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import teaching_plots as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "xlim = (1875,2030)\n",
    "ylim = (2.5, 6.5)\n",
    "yhat = (y-offset)/scale\n",
    "\n",
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "_ = ax.plot(x, y, 'r.',markersize=10)\n",
    "ax.set_xlabel('year', fontsize=20)\n",
    "ax.set_ylabel('pace min/km', fontsize=20)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "\n",
    "mlai.write_figure(figure=fig, \n",
    "                  filename='../slides/diagrams/datasets/olympic-marathon.svg', \n",
    "                  transparent=True, \n",
    "                  frameon=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/datasets/olympic-marathon.svg\" class=\"\" align=\"\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Olympic marathon pace times since 1892.</i>\n",
    "\n",
    "Things to notice about the data include the outlier in 1904, in this\n",
    "year, the olympics was in St Louis, USA. Organizational problems and\n",
    "challenges with dust kicked up by the cars following the race meant that\n",
    "participants got lost, and only very few participants completed.\n",
    "\n",
    "More recent years see more consistently quick marathons.\n",
    "\n",
    "## Regression: Linear Releationship \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/linear-algebra-regression.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "For many their first encounter with what might be termed a machine\n",
    "learning method is fitting a straight line. A straight line is\n",
    "characterized by two parameters, the scale, $m$, and the offset $c$.\n",
    "\n",
    "$$\\dataScalar_i = m \\inputScalar_i + c$$\n",
    "\n",
    "For the olympic marathon example $\\dataScalar_i$ is the winning pace and\n",
    "it is given as a function of the year which is represented by\n",
    "$\\inputScalar_i$. There are two further parameters of the prediction\n",
    "function. For the olympics example we can interpret these parameters,\n",
    "the scale $m$ is the rate of improvement of the olympic marathon pace on\n",
    "a yearly basis. And $c$ is the winning pace as estimated at year 0.\n",
    "\n",
    "\\defined{overdeterminedInaugural}\n",
    "## Overdetermined System \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/overdetermined-inaugural.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The challenge with a linear model is that it has two unknowns, $m$, and\n",
    "$c$. Observing data allows us to write down a system of simultaneous\n",
    "linear equations. So, for example if we observe two data points, the\n",
    "first with the input value, $\\inputScalar_1 = 1$ and the output value,\n",
    "$\\dataScalar_1 =3$ and a second data point, $\\inputScalar = 3$,\n",
    "$\\dataScalar=1$, then we can write two simultaneous linear equations of\n",
    "the form.\n",
    "\n",
    "point 1: $\\inputScalar = 1$, $\\dataScalar=3$ $$3 = m + c$$ point 2:\n",
    "$\\inputScalar = 3$, $\\dataScalar=1$ $$1 = 3m + c$$\n",
    "\n",
    "The solution to these two simultaneous equations can be represented\n",
    "graphically as\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/over_determined_system003.svg\" class=\"\" align=\"40%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>The solution of two linear equations represented as the fit\n",
    "of a straight line through two data</i>\n",
    "\n",
    "The challenge comes when a third data point is observed and it doesn't\n",
    "naturally fit on the straight line.\n",
    "\n",
    "point 3: $\\inputScalar = 2$, $\\dataScalar=2.5$ $$2.5 = 2m + c$$\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/over_determined_system004.svg\" class=\"\" align=\"40%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>A third observation of data is inconsistent with the solution\n",
    "dictated by the first two observations</i>\n",
    "\n",
    "Now there are three candidate lines, each consistent with our data.\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/over_determined_system007.svg\" class=\"\" align=\"40%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Three solutions to the problem, each consistent with two\n",
    "points of the three observations</i>\n",
    "\n",
    "This is known as an *overdetermined* system because there are more data\n",
    "than we need to determine our parameters. The problem arises because the\n",
    "model is a simplification of the real world, and the data we observe is\n",
    "therefore inconsistent with our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('over_determined_system{samp:0>3}.svg',\n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            samp=IntSlider(1,1,7,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The solution was proposed by Pierre-Simon Laplace. His idea was to\n",
    "accept that the model was an incomplete representation of the real\n",
    "world, and the manner in which it was incomplete is *unknown*. His idea\n",
    "was that such unknowns could be dealt with through probability.\n",
    "\n",
    "### Pierre-Simon Laplace \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/laplace-portrait.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/Pierre-Simon_Laplace.png\" style=\"width:30%\">\n",
    "\n",
    "Figure: <i>Pierre-Simon Laplace 1749-1827.</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Famously, Laplace considered the idea of a deterministic Universe, one\n",
    "in which the model is *known*, or as the below translation refers to it,\n",
    "\"an intelligence which could comprehend all the forces by which nature\n",
    "is animated\". He speculates on an \"intelligence\" that can submit this\n",
    "vast data to analysis and propsoses that such an entity would be able to\n",
    "predict the future.\n",
    "\n",
    "> Given for one instant an intelligence which could comprehend all the\n",
    "> forces by which nature is animated and the respective situation of the\n",
    "> beings who compose it---an intelligence sufficiently vast to submit\n",
    "> these data to analysis---it would embrace in the same formulate the\n",
    "> movements of the greatest bodies of the universe and those of the\n",
    "> lightest atom; for it, nothing would be uncertain and the future, as\n",
    "> the past, would be present in its eyes.\n",
    "\n",
    "This notion is known as *Laplace's demon* or *Laplace's superman*.\n",
    "\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/laplacesDeterminismEnglish.png\" style=\"width:60%\">\n",
    "\n",
    "Figure: <i>Laplace's determinsim in English translation.</i>\n",
    "\n",
    "Unfortunately, most analyses of his ideas stop at that point, whereas\n",
    "his real point is that such a notion is unreachable. Not so much\n",
    "*superman* as *strawman*. Just three pages later in the \"Philosophical\n",
    "Essay on Probabilities\" [@Laplace:essai14], Laplace goes on to observe:\n",
    "\n",
    "> The curve described by a simple molecule of air or vapor is regulated\n",
    "> in a manner just as certain as the planetary orbits; the only\n",
    "> difference between them is that which comes from our ignorance.\n",
    ">\n",
    "> Probability is relative, in part to this ignorance, in part to our\n",
    "> knowledge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA4')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/philosophicaless00lapliala.png\" style=\"width:60%\">\n",
    "\n",
    "Figure: <i>To Laplace, determinism is a strawman. Ignorance of mechanism\n",
    "and data leads to uncertainty which should be dealt with through\n",
    "probability.</i>\n",
    "\n",
    "In other words, we can never make use of the idealistic deterministc\n",
    "Universe due to our ignorance about the world, Laplace's suggestion, and\n",
    "focus in this essay is that we turn to probability to deal with this\n",
    "uncertainty. This is also our inspiration for using probability in\n",
    "machine learning.\n",
    "\n",
    "The \"forces by which nature is animated\" is our *model*, the \"situation\n",
    "of beings that compose it\" is our *data* and the \"intelligence\n",
    "sufficiently vast enough to submit these data to analysis\" is our\n",
    "compute. The fly in the ointment is our *ignorance* about these aspects.\n",
    "And *probability* is the tool we use to incorporate this ignorance\n",
    "leading to uncertainty or *doubt* in our predictions.\n",
    "\n",
    "Laplace's concept was that the reason that the data doesn't match up to\n",
    "the model is because of unconsidered factors, and that these might be\n",
    "well represented through probability densities. He tackles the challenge\n",
    "of the unknown factors by adding a variable, $\\noiseScalar$, that\n",
    "represents the unknown. In modern parlance we would call this a *latent*\n",
    "variable. But in the context Laplace uses it, the variable is so common\n",
    "that it has other names such as a \"slack\" variable or the *noise* in the\n",
    "system.\n",
    "\n",
    "point 1: $\\inputScalar = 1$, $\\dataScalar=3$ $$\n",
    "3 = m + c + \\noiseScalar_1\n",
    "$$ point 2: $\\inputScalar = 3$, $\\dataScalar=1$ $$\n",
    "1 = 3m + c + \\noiseScalar_2\n",
    "$$ point 3: $\\inputScalar = 2$, $\\dataScalar=2.5$ $$\n",
    "2.5 = 2m + c + \\noiseScalar_3\n",
    "$$\n",
    "\n",
    "Laplace's trick has converted the *overdetermined* system into an\n",
    "*underdetermined* system. He has now added three variables,\n",
    "$\\{\\noiseScalar_i\\}_{i=1}^3$, which represent the unknown corruptions of\n",
    "the real world. Laplace's idea is that we should represent that unknown\n",
    "corruption with a *probability distribution*.\n",
    "\n",
    "## A Probabilistic Process \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/overdetermined-inaugural.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "However, it was left to an admirer of Gauss to develop a practical\n",
    "probability density for that purpose. It was Carl Friederich Gauss who\n",
    "suggested that the *Gaussian* density (which at the time was unnamed!)\n",
    "should be used to represent this error.\n",
    "\n",
    "The result is a *noisy* function, a function which has a deterministic\n",
    "part, and a stochastic part. This type of function is sometimes known as\n",
    "a probabilistic or stochastic process, to distinguish it from a\n",
    "deterministic process.\n",
    "\n",
    "## The Gaussian Density \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/univariate-gaussian.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The Gaussian density is perhaps the most commonly used probability\n",
    "density. It is defined by a *mean*, $\\meanScalar$, and a *variance*,\n",
    "$\\dataStd^2$. The variance is taken to be the square of the *standard\n",
    "deviation*, $\\dataStd$.\n",
    "\n",
    "$$\\begin{align}\n",
    "  p(\\dataScalar| \\meanScalar, \\dataStd^2) & = \\frac{1}{\\sqrt{2\\pi\\dataStd^2}}\\exp\\left(-\\frac{(\\dataScalar - \\meanScalar)^2}{2\\dataStd^2}\\right)\\\\& \\buildrel\\triangle\\over = \\gaussianDist{\\dataScalar}{\\meanScalar}{\\dataStd^2}\n",
    "  \\end{align}$$\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/gaussian_of_height.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>The Gaussian PDF with ${\\meanScalar}=1.7$ and variance\n",
    "${\\dataStd}^2=0.0225$. Mean shown as red line. It could represent the\n",
    "heights of a population of students.</i>\n",
    "\n",
    "## Two Important Gaussian Properties \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/univariate-gaussian-properties.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The Gaussian density has many important properties, but for the moment\n",
    "we'll review two of them.\n",
    "\n",
    "## Sum of Gaussians\n",
    "\n",
    "If we assume that a variable, $\\dataScalar_i$, is sampled from a\n",
    "Gaussian density,\n",
    "\n",
    "$$\\dataScalar_i \\sim \\gaussianSamp{\\meanScalar_i}{\\sigma_i^2}$$\n",
    "\n",
    "Then we can show that the sum of a set of variables, each drawn\n",
    "independently from such a density is also distributed as Gaussian. The\n",
    "mean of the resulting density is the sum of the means, and the variance\n",
    "is the sum of the variances,\n",
    "\n",
    "$$\\sum_{i=1}^{\\numData} \\dataScalar_i \\sim \\gaussianSamp{\\sum_{i=1}^\\numData \\meanScalar_i}{\\sum_{i=1}^\\numData \\sigma_i^2}$$\n",
    "\n",
    "Since we are very familiar with the Gaussian density and its properties,\n",
    "it is not immediately apparent how unusual this is. Most random\n",
    "variables, when you add them together, change the family of density they\n",
    "are drawn from. For example, the Gaussian is exceptional in this regard.\n",
    "Indeed, other random variables, if they are independently drawn and\n",
    "summed together tend to a Gaussian density. That is the [*central limit\n",
    "theorem*](https://en.wikipedia.org/wiki/Central_limit_theorem) which is\n",
    "a major justification for the use of a Gaussian density.\n",
    "\n",
    "## Scaling a Gaussian\n",
    "\n",
    "Less unusual is the *scaling* property of a Gaussian density. If a\n",
    "variable, $\\dataScalar$, is sampled from a Gaussian density,\n",
    "\n",
    "$$\\dataScalar \\sim \\gaussianSamp{\\meanScalar}{\\sigma^2}$$ and we choose\n",
    "to scale that variable by a *deterministic* value, $\\mappingScalar$,\n",
    "then the *scaled variable* is distributed as\n",
    "\n",
    "$$\\mappingScalar \\dataScalar \\sim \\gaussianSamp{\\mappingScalar\\meanScalar}{\\mappingScalar^2 \\sigma^2}.$$\n",
    "Unlike the summing properties, where adding two or more random variables\n",
    "independently sampled from a family of densitites typically brings the\n",
    "summed variable *outside* that family, scaling many densities leaves the\n",
    "distribution of that variable in the same *family* of densities. Indeed,\n",
    "many densities include a *scale* parameter (e.g. the [Gamma\n",
    "density](https://en.wikipedia.org/wiki/Gamma_distribution)) which is\n",
    "purely for this purpose. In the Gaussian the standard deviation,\n",
    "$\\dataStd$, is the scale parameter. To see why this makes sense, let's\n",
    "consider, $$z \\sim \\gaussianSamp{0}{1},$$ then if we scale by $\\dataStd$\n",
    "so we have, $\\dataScalar=\\dataStd z$, we can write,\n",
    "$$\\dataScalar =\\dataStd z \\sim \\gaussianSamp{0}{\\dataStd^2}$$\n",
    "\n",
    "# Laplace's Idea\n",
    "\n",
    "Laplace had the idea to augment the observations by noise, that is\n",
    "equivalent to considering a probability density whose mean is given by\n",
    "the *prediction function*\n",
    "$$p\\left(\\dataScalar_i|\\inputScalar_i\\right)=\\frac{1}{\\sqrt{2\\pi\\dataStd^2}}\\exp\\left(-\\frac{\\left(\\dataScalar_i-f\\left(\\inputScalar_i\\right)\\right)^{2}}{2\\dataStd^2}\\right).$$\n",
    "\n",
    "This is known as *stochastic process*. It is a function that is\n",
    "corrupted by noise. Laplace didn't suggest the Gaussian density for that\n",
    "purpose, that was an innovation from Carl Friederich Gauss, which is\n",
    "what gives the Gaussian density its name.\n",
    "\n",
    "## Height as a Function of Weight\n",
    "\n",
    "In the standard Gaussian, parametized by mean and variance.\n",
    "\n",
    "Make the mean a linear function of an *input*.\n",
    "\n",
    "This leads to a regression model. $$\n",
    "\\begin{align*}\n",
    "  \\dataScalar_i=&\\mappingFunction\\left(\\inputScalar_i\\right)+\\noiseScalar_i,\\\\\n",
    "         \\noiseScalar_i \\sim & \\gaussianSamp{0}{\\dataStd^2}.\n",
    "  \\end{align*}\n",
    "$$\n",
    "\n",
    "Assume $\\dataScalar_i$ is height and $\\inputScalar_i$ is weight.\n",
    "\n",
    "\\sides{\n",
    "If the noise, $\\epsilon_i$ is sampled independently for each data point.\n",
    "Each data point is independent (given $m$ and $c$).\n",
    "For *independent* variables:\n",
    "$$\n",
    "p(\\dataVector) = \\prod_{i=1}^\\numData p(\\dataScalar_i)\n",
    "$$\n",
    "$$\n",
    "p(\\dataVector|\\inputVector, m, c) = \\prod_{i=1}^\\numData p(\\dataScalar_i|\\inputScalar_i, m, c)\n",
    "$$\n",
    "}\n",
    "# Sum of Squares Error\n",
    "\n",
    "Minimizing the sum of squares error was first proposed by\n",
    "[Legendre](http://en.wikipedia.org/wiki/Adrien-Marie_Legendre) in 1805.\n",
    "His book, which was on the orbit of comets, is available on google\n",
    "books, we can take a look at the relevant page by calling the code\n",
    "below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "pods.notebook.display_google_book(id='spcAAAAAMAAJ', page='PA72')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, the main text is in French, but the key part we are\n",
    "interested in can be roughly translated as\n",
    "\n",
    "> In most matters where we take measures data through observation, the\n",
    "> most accurate results they can offer, it is almost always leads to a\n",
    "> system of equations of the form $$E = a + bx + cy + fz + etc .$$ where\n",
    "> $a$, $b$, $c$, $f$ etc are the known coefficients and $x$, $y$, $z$\n",
    "> etc are unknown and must be determined by the condition that the value\n",
    "> of E is reduced, for each equation, to an amount or zero or very\n",
    "> small.\n",
    "\n",
    "He continues\n",
    "\n",
    "> Of all the principles that we can offer for this item, I think it is\n",
    "> not broader, more accurate, nor easier than the one we have used in\n",
    "> previous research application, and that is to make the minimum sum of\n",
    "> the squares of the errors. By this means, it is between the errors a\n",
    "> kind of balance that prevents extreme to prevail, is very specific to\n",
    "> make known the state of the closest to the truth system. The sum of\n",
    "> the squares of the errors\n",
    "> $E^2 + \\left.E^\\prime\\right.^2 + \\left.E^{\\prime\\prime}\\right.^2 + etc$\n",
    "> being \\begin{align*}   &(a + bx + cy + fz + etc)^2 \\\\\n",
    "> + &(a^\\prime +\n",
    "> b^\\prime x + c^\\prime y + f^\\prime z + etc ) ^2\\\\\n",
    "> + &(a^{\\prime\\prime} +\n",
    "> b^{\\prime\\prime}x  + c^{\\prime\\prime}y +  f^{\\prime\\prime}z + etc )^2 \\\\\n",
    "> + & etc\n",
    "> \\end{align*} if we wanted a minimum, by varying x alone, we will have\n",
    "> the equation ...\n",
    "\n",
    "This is the earliest know printed version of the problem of least\n",
    "squares. The notation, however, is a little awkward for mordern eyes. In\n",
    "particular Legendre doesn't make use of the sum sign, $$\n",
    "\\sum_{i=1}^3 z_i = z_1\n",
    "+ z_2 + z_3\n",
    "$$ nor does he make use of the inner product.\n",
    "\n",
    "In our notation, if we were to do linear regression, we would need to\n",
    "subsititue: $$\\begin{align*}\n",
    "a &\\leftarrow \\dataScalar_1-c, \\\\ a^\\prime &\\leftarrow \\dataScalar_2-c,\\\\ a^{\\prime\\prime} &\\leftarrow\n",
    "\\dataScalar_3 -c,\\\\ \n",
    "\\text{etc.} \n",
    "\\end{align*}$$ to introduce the data observations\n",
    "$\\{\\dataScalar_i\\}_{i=1}^{\\numData}$ alongside $c$, the offset. We would\n",
    "then introduce the input locations $$\\begin{align*}\n",
    "b & \\leftarrow \\inputScalar_1,\\\\\n",
    "b^\\prime & \\leftarrow \\inputScalar_2,\\\\\n",
    "b^{\\prime\\prime} & \\leftarrow \\inputScalar_3\\\\\n",
    "\\text{etc.}\n",
    "\\end{align*}$$ and finally the gradient of the function\n",
    "$$x \\leftarrow -m.$$ The remaining coefficients ($c$ and $f$) would then\n",
    "be zero. That would give us $$\\begin{align*}   &(\\dataScalar_1 -\n",
    "(m\\inputScalar_1+c))^2 \\\\\n",
    "+ &(\\dataScalar_2 -(m\\inputScalar_2 + c))^2\\\\\n",
    "+ &(\\dataScalar_3 -(m\\inputScalar_3 + c))^2 \\\\\n",
    "+ &\n",
    "\\text{etc.}\n",
    "\\end{align*}$$ which we would write in the modern notation for sums as\n",
    "$$\n",
    "\\sum_{i=1}^\\numData (\\dataScalar_i-(m\\inputScalar_i + c))^2\n",
    "$$ which is recognised as the sum of squares error for a linear\n",
    "regression.\n",
    "\n",
    "This shows the advantage of modern [summation\n",
    "operator](http://en.wikipedia.org/wiki/Summation), $\\sum$, in keeping\n",
    "our mathematical notation compact. Whilst it may look more complicated\n",
    "the first time you see it, understanding the mathematical rules that go\n",
    "around it, allows us to go much further with the notation.\n",
    "\n",
    "Inner products (or [dot\n",
    "products](http://en.wikipedia.org/wiki/Dot_product)) are similar. They\n",
    "allow us to write $$\n",
    "\\sum_{i=1}^q u_i v_i\n",
    "$$ in a more compact notation, $\\mathbf{u}\\cdot\\mathbf{v}.$\n",
    "\n",
    "Here we are using bold face to represent vectors, and we assume that the\n",
    "individual elements of a vector $\\mathbf{z}$ are given as a series of\n",
    "scalars $$\n",
    "\\mathbf{z} = \\begin{bmatrix} z_1\\\\ z_2\\\\ \\vdots\\\\ z_\\numData\n",
    "\\end{bmatrix}\n",
    "$$ which are each indexed by their position in the vector.\n",
    "\n",
    "# Linear Algebra\n",
    "\n",
    "Linear algebra provides a very similar role, when we introduce [linear\n",
    "algebra](http://en.wikipedia.org/wiki/Linear_algebra), it is because we\n",
    "are faced with a large number of addition and multiplication operations.\n",
    "These operations need to be done together and would be very tedious to\n",
    "write down as a group. So the first reason we reach for linear algebra\n",
    "is for a more compact representation of our mathematical formulae.\n",
    "\n",
    "## Running Example: Olympic Marathons\n",
    "\n",
    "Now we will load in the Olympic marathon data. This is data of the\n",
    "olympic marath times for the men's marathon from the first olympics in\n",
    "1896 up until the London 2012 olympics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see what these values are by typing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(x)\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that they are not `pandas` data frames for this example, they are\n",
    "just arrays of dimensionality $\\numData\\times 1$, where $\\numData$ is\n",
    "the number of data.\n",
    "\n",
    "The aim of this lab is to have you coding linear regression in python.\n",
    "We will do it in two ways, once using iterative updates (coordinate\n",
    "ascent) and then using linear algebra. The linear algebra approach will\n",
    "not only work much better, it is easy to extend to multiple input linear\n",
    "regression and *non-linear* regression using basis functions.\n",
    "\n",
    "## Plotting the Data\n",
    "\n",
    "You can make a plot of $\\dataScalar$ vs $\\inputScalar$ with the\n",
    "following command:\n",
    "\n",
    "## Maximum Likelihood: Iterative Solution\n",
    "\n",
    "Now we will take the maximum likelihood approach we derived in the\n",
    "lecture to fit a line, $\\dataScalar_i=m\\inputScalar_i + c$, to the data\n",
    "you've plotted. We are trying to minimize the error function: $$\n",
    "\\errorFunction(m, c) =  \\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i-c)^2\n",
    "$$ with respect to $m$, $c$ and $\\sigma^2$. We can start with an initial\n",
    "guess for $m$,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = -0.4\n",
    "c = 80"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we use the maximum likelihood update to find an estimate for the\n",
    "offset, $c$.\n",
    "\n",
    "## Coordinate Descent\n",
    "\n",
    "In the movie recommender system example, we minimised the objective\n",
    "function by steepest descent based gradient methods. Our updates\n",
    "required us to compute the gradient at the position we were located,\n",
    "then to update the gradient according to the direction of steepest\n",
    "descent. This time, we will take another approach. It is known as\n",
    "*coordinate descent*. In coordinate descent, we choose to move one\n",
    "parameter at a time. Ideally, we design an algorithm that at each step\n",
    "moves the parameter to its minimum value. At each step we choose to move\n",
    "the individual parameter to its minimum.\n",
    "\n",
    "To find the minimum, we look for the point in the curve where the\n",
    "gradient is zero. This can be found by taking the gradient of\n",
    "$\\errorFunction(m,c)$ with respect to the parameter.\n",
    "\n",
    "### Update for Offset\n",
    "\n",
    "Let's consider the parameter $c$ first. The gradient goes nicely through\n",
    "the summation operator, and we obtain $$\n",
    "\\frac{\\text{d}\\errorFunction(m,c)}{\\text{d}c} = -\\sum_{i=1}^\\numData 2(\\dataScalar_i-m\\inputScalar_i-c).\n",
    "$$ Now we want the point that is a minimum. A minimum is an example of a\n",
    "[*stationary point*](http://en.wikipedia.org/wiki/Stationary_point), the\n",
    "stationary points are those points of the function where the gradient is\n",
    "zero. They are found by solving the equation for\n",
    "$\\frac{\\text{d}\\errorFunction(m,c)}{\\text{d}c} = 0$. Substituting in to\n",
    "our gradient, we can obtain the following equation, $$\n",
    "0 = -\\sum_{i=1}^\\numData 2(\\dataScalar_i-m\\inputScalar_i-c)\n",
    "$$ which can be reorganised as follows, $$\n",
    "c^* = \\frac{\\sum_{i=1}^\\numData(\\dataScalar_i-m^*\\inputScalar_i)}{\\numData}.\n",
    "$$ The fact that the stationary point is easily extracted in this manner\n",
    "implies that the solution is *unique*. There is only one stationary\n",
    "point for this system. Traditionally when trying to determine the type\n",
    "of stationary point we have encountered we now compute the *second\n",
    "derivative*, $$\n",
    "\\frac{\\text{d}^2\\errorFunction(m,c)}{\\text{d}c^2} = 2n.\n",
    "$$ The second derivative is positive, which in turn implies that we have\n",
    "found a minimum of the function. This means that setting $c$ in this way\n",
    "will take us to the lowest point along that axes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set c to the minimum\n",
    "c = (y - m*x).mean()\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update for Slope\n",
    "\n",
    "Now we have the offset set to the minimum value, in coordinate descent,\n",
    "the next step is to optimise another parameter. Only one further\n",
    "parameter remains. That is the slope of the system.\n",
    "\n",
    "Now we can turn our attention to the slope. We once again peform the\n",
    "same set of computations to find the minima. We end up with an update\n",
    "equation of the following form.\n",
    "\n",
    "$$m^* = \\frac{\\sum_{i=1}^\\numData (\\dataScalar_i - c)\\inputScalar_i}{\\sum_{i=1}^\\numData \\inputScalar_i^2}$$\n",
    "\n",
    "Communication of mathematics in data science is an essential skill, in a\n",
    "moment, you will be asked to rederive the equation above. Before we do\n",
    "that, however, we will briefly review how to write mathematics in the\n",
    "notebook.\n",
    "\n",
    "## $\\LaTeX$ for Maths\n",
    "\n",
    "These cells use [Markdown\n",
    "format](http://en.wikipedia.org/wiki/Markdown). You can include maths in\n",
    "your markdown using [$\\LaTeX$\n",
    "syntax](http://en.wikipedia.org/wiki/LaTeX), all you have to do is write\n",
    "your answer inside dollar signs, as follows:\n",
    "\n",
    "To write a fraction, we write `$\\frac{a}{b}$`, and it will display like\n",
    "this $\\frac{a}{b}$. To write a subscript we write `$a_b$` which will\n",
    "appear as $a_b$. To write a superscript (for example in a polynomial) we\n",
    "write `$a^b$` which will appear as $a^b$. There are lots of other macros\n",
    "as well, for example we can do greek letters such as\n",
    "`$\\alpha, \\beta, \\gamma$` rendering as $\\alpha, \\beta, \\gamma$. And we\n",
    "can do sum and intergral signs as `$\\sum \\int \\int$`.\n",
    "\n",
    "You can combine many of these operations together for composing\n",
    "expressions.\n",
    "\n",
    "### Question 1\n",
    "\n",
    "Convert the following python code expressions into $\\LaTeX$j, writing\n",
    "your answers below. In each case write your answer as a single equality\n",
    "(i.e. your maths should only contain one expression, not several lines\n",
    "of expressions). For the purposes of your $\\LaTeX$ please assume that\n",
    "`x` and `w` are $n$ dimensional vectors.\n",
    "\n",
    "`(a) f = x.sum()`\n",
    "\n",
    "`(b) m = x.mean()`\n",
    "\n",
    "`(c) g = (x*w).sum()`\n",
    "\n",
    "*15 marks*\n",
    "\n",
    "### Write your answer to Question 1 here\n",
    "\n",
    "## Fixed Point Updates\n",
    "\n",
    "[Worked example.]{style=\"text-align:left\"} $$\n",
    "\\begin{aligned}\n",
    "    c^{*}=&\\frac{\\sum\n",
    "_{i=1}^{\\numData}\\left(\\dataScalar_i-m^{*}\\inputScalar_i\\right)}{\\numData},\\\\\n",
    "    m^{*}=&\\frac{\\sum\n",
    "_{i=1}^{\\numData}\\inputScalar_i\\left(\\dataScalar_i-c^{*}\\right)}{\\sum _{i=1}^{\\numData}\\inputScalar_i^{2}},\\\\\n",
    "\\left.\\dataStd^2\\right.^{*}=&\\frac{\\sum\n",
    "_{i=1}^{\\numData}\\left(\\dataScalar_i-m^{*}\\inputScalar_i-c^{*}\\right)^{2}}{\\numData}\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "## Gradient With Respect to the Slope\n",
    "\n",
    "Now that you've had a little training in writing maths with $\\LaTeX$, we\n",
    "will be able to use it to answer questions. The next thing we are going\n",
    "to do is a little differentiation practice.\n",
    "\n",
    "### Question 2\n",
    "\n",
    "Derive the the gradient of the objective function with respect to the\n",
    "slope, $m$. Rearrange it to show that the update equation written above\n",
    "does find the stationary points of the objective function. By computing\n",
    "its derivative show that it's a minimum.\n",
    "\n",
    "*20 marks*\n",
    "\n",
    "### Write your answer to Question 2 here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = ((y - c)*x).sum()/(x**2).sum()\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can have a look at how good our fit is by computing the prediction\n",
    "across the input space. First create a vector of 'test points',"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_test = np.linspace(1890, 2020, 130)[:, None]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now use this vector to compute some test predictions,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f_test = m*x_test + c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now plot those test predictions with a blue line on the same plot as the\n",
    "data,\n",
    "\n",
    "The fit isn't very good, we need to iterate between these parameter\n",
    "updates in a loop to improve the fit, we have to do this several times,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in np.arange(10):\n",
    "    m = ((y - c)*x).sum()/(x*x).sum()\n",
    "    c = (y-m*x).sum()/y.shape[0]\n",
    "print(m)\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's try plotting the result again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f_test = m*x_test + c\n",
    "plt.plot(x_test, f_test, 'b-')\n",
    "plt.plot(x, y, 'rx')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly we need more iterations than 10! In the next question you will\n",
    "add more iterations and report on the error as optimisation proceeds.\n",
    "\n",
    "### Question 3\n",
    "\n",
    "There is a problem here, we seem to need many interations to get to a\n",
    "good solution. Let's explore what's going on. Write code which\n",
    "alternates between updates of `c` and `m`. Include the following\n",
    "features in your code.\n",
    "\n",
    "1.  Initialise with `m=-0.4` and `c=80`.\n",
    "2.  Every 10 iterations compute the value of the objective function for\n",
    "    the training data and print it to the screen (you'll find hints on\n",
    "    this in [the lab from last week](./week2.ipynb)).\n",
    "3.  Cause the code to stop running when the error change over less than\n",
    "    10 iterations is smaller than $1\\times10^{-4}$. This is known as a\n",
    "    stopping criterion.\n",
    "\n",
    "Why do we need so many iterations to get to the solution?\n",
    "\n",
    "*25 marks*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write code for your answer to Question 3 in this box\n",
    "# provide the answers so that the code runs correctly otherwise you will loose marks!\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Important Concepts Not Covered\n",
    "\n",
    "-   Other optimization methods:\n",
    "    -   Second order methods, conjugate gradient, quasi-Newton and\n",
    "        Newton.\n",
    "-   Effective heuristics such as momentum.\n",
    "-   Local vs global solutions.\n",
    "\n",
    "## Reading\n",
    "\n",
    "-   Section 1.1-1.2 of @Rogers:book11 for fitting linear models.\n",
    "-   Section 1.2.5 of @Bishop:book06 up to equation 1.65.\n",
    "\n",
    "## Objective Functions and Regression \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/regression.md\" target=\"_blank\" >edit</a>\\]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.random.normal(size=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_true = 1.4\n",
    "c_true = -3.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = m_true*x+c_true"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/regression.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>A simple linear regression.</i>\n",
    "\n",
    "## Noise Corrupted Plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "noise = np.random.normal(scale=0.5, size=4) # standard deviation of the noise is 0.5\n",
    "y = m_true*x + c_true + noise\n",
    "plt.plot(x, y, 'r.', markersize=10)\n",
    "plt.xlim([-3, 3])\n",
    "mlai.write_figure(filename=\"../slides/diagrams/ml/regression_noise.svg\", transparent=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/regression_noise.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>A simple linear regression with noise.</i>\n",
    "\n",
    "## Contour Plot of Error Function\n",
    "\n",
    "-   Visualise the error function surface, create vectors of values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create an array of linearly separated values around m_true\n",
    "m_vals = np.linspace(m_true-3, m_true+3, 100) \n",
    "# create an array of linearly separated values ae\n",
    "c_vals = np.linspace(c_true-3, c_true+3, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-   create a grid of values to evaluate the error function in 2D."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_grid, c_grid = np.meshgrid(m_vals, c_vals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-   compute the error function at each combination of $c$ and $m$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "E_grid = np.zeros((100, 100))\n",
    "for i in range(100):\n",
    "    for j in range(100):\n",
    "        E_grid[i, j] = ((y - m_grid[i, j]*x - c_grid[i, j])**2).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contour Plot of Error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load -s regression_contour teaching_plots.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/regression_contour.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Contours of the objective function for linear regression by\n",
    "minimizing least squares.</i>\n",
    "\n",
    "## Steepest Descent\n",
    "\n",
    "## Algorithm\n",
    "\n",
    "-   We start with a guess for $m$ and $c$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_star = 0.0\n",
    "c_star = -5.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Offset Gradient\n",
    "\n",
    "-   Now we need to compute the gradient of the error function, firstly\n",
    "    with respect to $c$,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "$$\\frac{\\text{d}\\errorFunction(m, c)}{\\text{d} c} =\n",
    "-2\\sum_{i=1}^\\numData (\\dataScalar_i - m\\inputScalar_i - c)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-   This is computed in python as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_grad = -2*(y-m_star*x - c_star).sum()\n",
    "print(\"Gradient with respect to c is \", c_grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deriving the Gradient\n",
    "\n",
    "To see how the gradient was derived, first note that the $c$ appears in\n",
    "every term in the sum. So we are just differentiating\n",
    "$(\\dataScalar_i - m\\inputScalar_i - c)^2$ for each term in the sum. The\n",
    "gradient of this term with respect to $c$ is simply the gradient of the\n",
    "outer quadratic, multiplied by the gradient with respect to $c$ of the\n",
    "part inside the quadratic. The gradient of a quadratic is two times the\n",
    "argument of the quadratic, and the gradient of the inside linear term is\n",
    "just minus one. This is true for all terms in the sum, so we are left\n",
    "with the sum in the gradient.\n",
    "\n",
    "## Slope Gradient\n",
    "\n",
    "The gradient with respect tom $m$ is similar, but now the gradient of\n",
    "the quadratic's argument is $-\\inputScalar_i$ so the gradient with\n",
    "respect to $m$ is\n",
    "\n",
    "$$\\frac{\\text{d}\\errorFunction(m, c)}{\\text{d} m} = -2\\sum_{i=1}^\\numData \\inputScalar_i(\\dataScalar_i - m\\inputScalar_i -\n",
    "c)$$\n",
    "\n",
    "which can be implemented in python (numpy) as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_grad = -2*(x*(y-m_star*x - c_star)).sum()\n",
    "print(\"Gradient with respect to m is \", m_grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update Equations\n",
    "\n",
    "-   Now we have gradients with respect to $m$ and $c$.\n",
    "-   Can update our inital guesses for $m$ and $c$ using the gradient.\n",
    "-   We don't want to just subtract the gradient from $m$ and $c$,\n",
    "-   We need to take a *small* step in the gradient direction.\n",
    "-   Otherwise we might overshoot the minimum.\n",
    "-   We want to follow the gradient to get to the minimum, the gradient\n",
    "    changes all the time.\n",
    "\n",
    "## Move in Direction of Gradient\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/regression_contour_step001.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Single update descending the contours of the error surface\n",
    "for regression.</i>\n",
    "\n",
    "## Update Equations\n",
    "\n",
    "-   The step size has already been introduced, it's again known as the\n",
    "    learning rate and is denoted by $\\learnRate$. $$\n",
    "    c_\\text{new}\\leftarrow c_{\\text{old}} - \\learnRate \\frac{\\text{d}\\errorFunction(m, c)}{\\text{d}c}\n",
    "    $$\n",
    "\n",
    "-   gives us an update for our estimate of $c$ (which in the code we've\n",
    "    been calling `c_star` to represent a common way of writing a\n",
    "    parameter estimate, $c^*$) and $$\n",
    "    m_\\text{new} \\leftarrow m_{\\text{old}} - \\learnRate \\frac{\\text{d}\\errorFunction(m, c)}{\\text{d}m}\n",
    "    $$\n",
    "-   Giving us an update for $m$.\n",
    "\n",
    "## Update Code\n",
    "\n",
    "-   These updates can be coded as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Original m was\", m_star, \"and original c was\", c_star)\n",
    "learn_rate = 0.01\n",
    "c_star = c_star - learn_rate*c_grad\n",
    "m_star = m_star - learn_rate*m_grad\n",
    "print(\"New m is\", m_star, \"and new c is\", c_star)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Iterating Updates\n",
    "\n",
    "-   Fit model by descending gradient.\n",
    "\n",
    "## Gradient Descent Algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_plots = plot.regression_contour_fit(x, y, diagrams='../slides/diagrams/ml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('regression_contour_fit{num:0>3}.svg', directory='../slides/diagrams/ml', num=IntSlider(0, 0, num_plots, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "{\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/regression_contour_fit028.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Stochastic gradient descent for linear regression</i>\n",
    "\n",
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   If $\\numData$ is small, gradient descent is fine.\n",
    "-   But sometimes (e.g. on the internet $\\numData$ could be a billion.\n",
    "-   Stochastic gradient descent is more similar to perceptron.\n",
    "-   Look at gradient of one data point at a time rather than summing\n",
    "    across *all* data points)\n",
    "-   This gives a stochastic estimate of gradient.\n",
    "\n",
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   The real gradient with respect to $m$ is given by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "$$\\frac{\\text{d}\\errorFunction(m, c)}{\\text{d} m} = -2\\sum_{i=1}^\\numData \\inputScalar_i(\\dataScalar_i -\n",
    "m\\inputScalar_i - c)$$\n",
    "\n",
    "but it has $\\numData$ terms in the sum. Substituting in the gradient\n",
    "we can see that the full update is of the form\n",
    "\n",
    "$$m_\\text{new} \\leftarrow\n",
    "m_\\text{old} + 2\\learnRate \\left[\\inputScalar_1 (\\dataScalar_1 - m_\\text{old}\\inputScalar_1 - c_\\text{old}) + (\\inputScalar_2 (\\dataScalar_2 -   m_\\text{old}\\inputScalar_2 - c_\\text{old}) + \\dots + (\\inputScalar_n (\\dataScalar_n - m_\\text{old}\\inputScalar_n - c_\\text{old})\\right]$$\n",
    "\n",
    "This could be split up into lots of individual updates\n",
    "$$m_1 \\leftarrow m_\\text{old} + 2\\learnRate \\left[\\inputScalar_1 (\\dataScalar_1 - m_\\text{old}\\inputScalar_1 -\n",
    "c_\\text{old})\\right]$$\n",
    "$$m_2 \\leftarrow m_1 + 2\\learnRate \\left[\\inputScalar_2 (\\dataScalar_2 -\n",
    "m_\\text{old}\\inputScalar_2 - c_\\text{old})\\right]$$\n",
    "$$m_3 \\leftarrow m_2 + 2\\learnRate\n",
    "\\left[\\dots\\right]$$\n",
    "$$m_n \\leftarrow m_{n-1} + 2\\learnRate \\left[\\inputScalar_n (\\dataScalar_n -\n",
    "m_\\text{old}\\inputScalar_n - c_\\text{old})\\right]$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which would lead to the same final update.\n",
    "\n",
    "## Updating $c$ and $m$\n",
    "\n",
    "-   In the sum we don't $m$ and $c$ we use for computing the gradient\n",
    "    term at each update.\n",
    "-   In stochastic gradient descent we *do* change them.\n",
    "-   This means it's not quite the same as steepest desceint.\n",
    "-   But we can present each data point in a random order, like we did\n",
    "    for the perceptron.\n",
    "-   This makes the algorithm suitable for large scale web use (recently\n",
    "    this domain is know as 'Big Data') and algorithms like this are\n",
    "    widely used by Google, Microsoft, Amazon, Twitter and Facebook.\n",
    "\n",
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   Or more accurate, since the data is normally presented in a random\n",
    "    order we just can write $$\n",
    "    m_\\text{new} = m_\\text{old} + 2\\learnRate\\left[\\inputScalar_i (\\dataScalar_i - m_\\text{old}\\inputScalar_i - c_\\text{old})\\right]\n",
    "    $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose a random point for the update \n",
    "i = np.random.randint(x.shape[0]-1)\n",
    "# update m\n",
    "m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))\n",
    "# update c\n",
    "c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SGD for Linear Regression\n",
    "\n",
    "Putting it all together in an algorithm, we can do stochastic gradient\n",
    "descent for our regression data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('regression_sgd_contour_fit{num:0>3}.svg', \n",
    "    directory='../slides/diagrams/mlai', num=IntSlider(0, 0, num_plots, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/regression_sgd_contour_fit058.svg\" class=\"\" align=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Stochastic gradient descent for linear regression.</i>\n",
    "\n",
    "## Reflection on Linear Regression and Supervised Learning\n",
    "\n",
    "Think about:\n",
    "\n",
    "1.  What effect does the learning rate have in the optimization? What's\n",
    "    the effect of making it too small, what's the effect of making it\n",
    "    too big? Do you get the same result for both stochastic and steepest\n",
    "    gradient descent?\n",
    "\n",
    "2.  The stochastic gradient descent doesn't help very much for such a\n",
    "    small data set. It's real advantage comes when there are many,\n",
    "    you'll see this in the lab.\n",
    "\n",
    "# Multiple Input Solution with Linear Algebra\n",
    "\n",
    "You've now seen how slow it can be to perform a coordinate ascent on a\n",
    "system. Another approach to solving the system (which is not always\n",
    "possible, particularly in *non-linear* systems) is to go direct to the\n",
    "minimum. To do this we need to introduce *linear algebra*. We will\n",
    "represent all our errors and functions in the form of linear algebra. As\n",
    "we mentioned above, linear algebra is just a shorthand for performing\n",
    "lots of multiplications and additions simultaneously. What does it have\n",
    "to do with our system then? Well the first thing to note is that the\n",
    "linear function we were trying to fit has the following form: $$\n",
    "\\mappingFunction(x) = mx + c\n",
    "$$ the classical form for a straight line. From a linear algebraic\n",
    "perspective we are looking for multiplications and additions. We are\n",
    "also looking to separate our parameters from our data. The data is the\n",
    "*givens* remember, in French the word is données literally translated\n",
    "means *givens* that's great, because we don't need to change the data,\n",
    "what we need to change are the parameters (or variables) of the model.\n",
    "In this function the data comes in through $x$, and the parameters are\n",
    "$m$ and $c$.\n",
    "\n",
    "What we'd like to create is a vector of parameters and a vector of data.\n",
    "Then we could represent the system with vectors that represent the data,\n",
    "and vectors that represent the parameters.\n",
    "\n",
    "We look to turn the multiplications and additions into a linear\n",
    "algebraic form, we have one multiplication ($m\\times c$) and one\n",
    "addition ($mx + c$). But we can turn this into a inner product by\n",
    "writing it in the following way, $$\n",
    "\\mappingFunction(x) = m \\times x +\n",
    "c \\times 1,\n",
    "$$ in other words we've extracted the unit value, from the offset, $c$.\n",
    "We can think of this unit value like an extra item of data, because it\n",
    "is always given to us, and it is always set to 1 (unlike regular data,\n",
    "which is likely to vary!). We can therefore write each input data\n",
    "location, $\\inputVector$, as a vector $$\n",
    "\\inputVector = \\begin{bmatrix} 1\\\\ x\\end{bmatrix}.\n",
    "$$\n",
    "\n",
    "Now we choose to also turn our parameters into a vector. The parameter\n",
    "vector will be defined to contain $$\n",
    "\\mappingVector = \\begin{bmatrix} c \\\\ m\\end{bmatrix}\n",
    "$$ because if we now take the inner product between these to vectors we\n",
    "recover $$\n",
    "\\inputVector\\cdot\\mappingVector = 1 \\times c + x \\times m = mx + c\n",
    "$$ In `numpy` we can define this vector as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the vector w\n",
    "w = np.zeros(shape=(2, 1))\n",
    "w[0] = m\n",
    "w[1] = c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us the equivalence between original operation and an\n",
    "operation in vector space. Whilst the notation here isn't a lot shorter,\n",
    "the beauty is that we will be able to add as many features as we like\n",
    "and still keep the seame representation. In general, we are now moving\n",
    "to a system where each of our predictions is given by an inner product.\n",
    "When we want to represent a linear product in linear algebra, we tend to\n",
    "do it with the transpose operation, so since we have\n",
    "$\\mathbf{a}\\cdot\\mathbf{b} = \\mathbf{a}^\\top\\mathbf{b}$ we can write $$\n",
    "\\mappingFunction(\\inputVector_i) = \\inputVector_i^\\top\\mappingVector.\n",
    "$$ Where we've assumed that each data point, $\\inputVector_i$, is now\n",
    "written by appending a 1 onto the original vector $$\n",
    "\\inputVector_i = \\begin{bmatrix} \n",
    "1 \\\\\n",
    "\\inputScalar_i\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "\n",
    "# Design Matrix\n",
    "\n",
    "We can do this for the entire data set to form a [*design\n",
    "matrix*](http://en.wikipedia.org/wiki/Design_matrix) $\\inputMatrix$,\n",
    "\n",
    "$$\\inputMatrix\n",
    "= \\begin{bmatrix} \n",
    "\\inputVector_1^\\top \\\\\\ \n",
    "\\inputVector_2^\\top \\\\\\ \n",
    "\\vdots \\\\\\\n",
    "\\inputVector_\\numData^\\top\n",
    "\\end{bmatrix} = \\begin{bmatrix}\n",
    "1 & \\inputScalar_1 \\\\\\\n",
    "1 & \\inputScalar_2 \\\\\\\n",
    "\\vdots\n",
    "& \\vdots \\\\\\\n",
    "1 & \\inputScalar_\\numData \n",
    "\\end{bmatrix},$$\n",
    "\n",
    "which in `numpy` can be done with the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.hstack((np.ones_like(x), x))\n",
    "print(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing the Objective with Linear Algebra\n",
    "\n",
    "When we think of the objective function, we can think of it as the\n",
    "errors where the error is defined in a similar way to what it was in\n",
    "Legendre's day $\\dataScalar_i - \\mappingFunction(\\inputVector_i)$, in\n",
    "statistics these errors are also sometimes called\n",
    "[*residuals*](http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics).\n",
    "So we can think as the objective and the prediction function as two\n",
    "separate parts, first we have, $$\n",
    "\\errorFunction(\\mappingVector) = \\sum_{i=1}^\\numData (\\dataScalar_i - \\mappingFunction(\\inputVector_i; \\mappingVector))^2,\n",
    "$$ where we've made the function $\\mappingFunction(\\cdot)$'s dependence\n",
    "on the parameters $\\mappingVector$ explicit in this equation. Then we\n",
    "have the definition of the function itself, $$\n",
    "\\mappingFunction(\\inputVector_i; \\mappingVector) = \\inputVector_i^\\top \\mappingVector.\n",
    "$$ Let's look again at these two equations and see if we can identify\n",
    "any inner products. The first equation is a sum of squares, which is\n",
    "promising. Any sum of squares can be represented by an inner product, $$\n",
    "a = \\sum_{i=1}^{k} b^2_i = \\mathbf{b}^\\top\\mathbf{b},\n",
    "$$ so if we wish to represent $\\errorFunction(\\mappingVector)$ in this\n",
    "way, all we need to do is convert the sum operator to an inner product.\n",
    "We can get a vector from that sum operator by placing both\n",
    "$\\dataScalar_i$ and $\\mappingFunction(\\inputVector_i; \\mappingVector)$\n",
    "into vectors, which we do by defining $$\n",
    "\\dataVector = \\begin{bmatrix}\\dataScalar_1\\\\ \\dataScalar_2\\\\ \\vdots \\\\ \\dataScalar_\\numData\\end{bmatrix}\n",
    "$$ and defining $$\n",
    "\\mappingFunctionVector(\\inputVector_1; \\mappingVector) = \\begin{bmatrix}\\mappingFunction(\\inputVector_1; \\mappingVector)\\\\ \\mappingFunction(\\inputVector_2; \\mappingVector)\\\\ \\vdots \\\\ \\mappingFunction(\\inputVector_\\numData; \\mappingVector)\\end{bmatrix}.\n",
    "$$ The second of these is actually a vector-valued function. This term\n",
    "may appear intimidating, but the idea is straightforward. A vector\n",
    "valued function is simply a vector whose elements are themselves defined\n",
    "as *functions*, i.e. it is a vector of functions, rather than a vector\n",
    "of scalars. The idea is so straightforward, that we are going to ignore\n",
    "it for the moment, and barely use it in the derivation. But it will\n",
    "reappear later when we introduce *basis functions*. So we will, for the\n",
    "moment, ignore the dependence of $\\mappingFunctionVector$ on\n",
    "$\\mappingVector$ and $\\inputMatrix$ and simply summarise it by a vector\n",
    "of numbers $$\n",
    "\\mappingFunctionVector = \\begin{bmatrix}\\mappingFunction_1\\\\\\mappingFunction_2\\\\\n",
    "\\vdots \\\\ \\mappingFunction_\\numData\\end{bmatrix}.\n",
    "$$ This allows us to write our objective in the folowing, linear\n",
    "algebraic form, $$\n",
    "\\errorFunction(\\mappingVector) = (\\dataVector - \\mappingFunctionVector)^\\top(\\dataVector - \\mappingFunctionVector)\n",
    "$$ from the rules of inner products. But what of our matrix\n",
    "$\\inputMatrix$ of input data? At this point, we need to dust off\n",
    "[*matrix-vector\n",
    "multiplication*](http://en.wikipedia.org/wiki/Matrix_multiplication).\n",
    "Matrix multiplication is simply a convenient way of performing many\n",
    "inner products together, and it's exactly what we need to summarise the\n",
    "operation $$\n",
    "f_i = \\inputVector_i^\\top\\mappingVector.\n",
    "$$ This operation tells us that each element of the vector\n",
    "$\\mappingFunctionVector$ (our vector valued function) is given by an\n",
    "inner product between $\\inputVector_i$ and $\\mappingVector$. In other\n",
    "words it is a series of inner products. Let's look at the definition of\n",
    "matrix multiplication, it takes the form $$\n",
    "\\mathbf{c} = \\mathbf{B}\\mathbf{a}\n",
    "$$ where $\\mathbf{c}$ might be a $k$ dimensional vector (which we can\n",
    "intepret as a $k\\times 1$ dimensional matrix), and $\\mathbf{B}$ is a\n",
    "$k\\times k$ dimensional matrix and $\\mathbf{a}$ is a $k$ dimensional\n",
    "vector ($k\\times 1$ dimensional matrix).\n",
    "\n",
    "The result of this multiplication is of the form $$\n",
    "\\begin{bmatrix}c_1\\\\c_2 \\\\ \\vdots \\\\\n",
    "a_k\\end{bmatrix} = \n",
    "\\begin{bmatrix} b_{1,1} & b_{1, 2} & \\dots & b_{1, k} \\\\\n",
    "b_{2, 1} & b_{2, 2} & \\dots & b_{2, k} \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
    "b_{k, 1} & b_{k, 2} & \\dots & b_{k, k} \\end{bmatrix} \\begin{bmatrix}a_1\\\\a_2 \\\\\n",
    "\\vdots\\\\ c_k\\end{bmatrix} = \\begin{bmatrix} b_{1, 1}a_1 + b_{1, 2}a_2 + \\dots +\n",
    "b_{1, k}a_k\\\\\n",
    "b_{2, 1}a_1 + b_{2, 2}a_2 + \\dots + b_{2, k}a_k \\\\ \n",
    "\\vdots\\\\\n",
    "b_{k, 1}a_1 + b_{k, 2}a_2 + \\dots + b_{k, k}a_k\\end{bmatrix}\n",
    "$$ so we see that each element of the result, $\\mathbf{a}$ is simply the\n",
    "inner product between each *row* of $\\mathbf{B}$ and the vector\n",
    "$\\mathbf{c}$. Because we have defined each element of\n",
    "$\\mappingFunctionVector$ to be given by the inner product between each\n",
    "*row* of the design matrix and the vector $\\mappingVector$ we now can\n",
    "write the full operation in one matrix multiplication, $$\n",
    "\\mappingFunctionVector = \\inputMatrix\\mappingVector.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = np.dot(X, w) # np.dot does matrix multiplication in python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining this result with our objective function, $$\n",
    "\\errorFunction(\\mappingVector) = (\\dataVector - \\mappingFunctionVector)^\\top(\\dataVector - \\mappingFunctionVector)\n",
    "$$ we find we have defined the *model* with two equations. One equation\n",
    "tells us the form of our predictive function and how it depends on its\n",
    "parameters, the other tells us the form of our objective function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resid = (y-f)\n",
    "E = np.dot(resid.T, resid) # matrix multiplication on a single vector is equivalent to a dot product.\n",
    "print(\"Error function is:\", E)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 4\n",
    "\n",
    "The prediction for our movie recommender system had the form $$\n",
    "f_{i,j} = \\mathbf{u}_i^\\top \\mathbf{v}_j\n",
    "$$ and the objective function was then $$\n",
    "E = \\sum_{i,j} s_{i,j}(\\dataScalar_{i,j} - f_{i, j})^2\n",
    "$$ Try writing this down in matrix and vector form. How many of the\n",
    "terms can you do? For each variable and parameter carefully think about\n",
    "whether it should be represented as a matrix or vector. Do as many of\n",
    "the terms as you can. Use $\\LaTeX$ to give your answers and give the\n",
    "*dimensions* of any matrices you create.\n",
    "\n",
    "*20 marks*\n",
    "\n",
    "### Write your answer to Question 4 here\n",
    "\n",
    "# Objective Optimisation\n",
    "\n",
    "Our *model* has now been defined with two equations, the prediction\n",
    "function and the objective function. Next we will use multivariate\n",
    "calculus to define an *algorithm* to fit the model. The separation\n",
    "between model and algorithm is important and is often overlooked. Our\n",
    "model contains a function that shows how it will be used for prediction,\n",
    "and a function that describes the objective function we need to optimise\n",
    "to obtain a good set of parameters.\n",
    "\n",
    "The model linear regression model we have described is still the same as\n",
    "the one we fitted above with a coordinate ascent algorithm. We have only\n",
    "played with the notation to obtain the same model in a matrix and vector\n",
    "notation. However, we will now fit this model with a different\n",
    "algorithm, one that is much faster. It is such a widely used algorithm\n",
    "that from the end user's perspective it doesn't even look like an\n",
    "algorithm, it just appears to be a single operation (or function).\n",
    "However, underneath the computer calls an algorithm to find the\n",
    "solution. Further, the algorithm we obtain is very widely used, and\n",
    "because of this it turns out to be highly optimised.\n",
    "\n",
    "Once again we are going to try and find the stationary points of our\n",
    "objective by finding the *stationary points*. However, the stationary\n",
    "points of a multivariate function, are a little bit more complext to\n",
    "find. Once again we need to find the point at which the derivative is\n",
    "zero, but now we need to use *multivariate calculus* to find it. This\n",
    "involves learning a few additional rules of differentiation (that allow\n",
    "you to do the derivatives of a function with respect to vector), but in\n",
    "the end it makes things quite a bit easier. We define vectorial\n",
    "derivatives as follows, $$\n",
    "\\frac{\\text{d}\\errorFunction(\\mappingVector)}{\\text{d}\\mappingVector} =\n",
    "\\begin{bmatrix}\\frac{\\text{d}\\errorFunction(\\mappingVector)}{\\text{d}\\mappingScalar_1}\\\\\\frac{\\text{d}\\errorFunction(\\mappingVector)}{\\text{d}\\mappingScalar_2}\\end{bmatrix}.\n",
    "$$ where\n",
    "$\\frac{\\text{d}\\errorFunction(\\mappingVector)}{\\text{d}\\mappingScalar_1}$\n",
    "is the [partial\n",
    "derivative](http://en.wikipedia.org/wiki/Partial_derivative) of the\n",
    "error function with respect to $\\mappingScalar_1$.\n",
    "\n",
    "Differentiation through multiplications and additions is relatively\n",
    "straightforward, and since linear algebra is just multiplication and\n",
    "addition, then its rules of diffentiation are quite straightforward too,\n",
    "but slightly more complex than regular derivatives.\n",
    "\n",
    "## Multivariate Derivatives\n",
    "\n",
    "We will need two rules of multivariate or *matrix* differentiation. The\n",
    "first is diffentiation of an inner product. By remembering that the\n",
    "inner product is made up of multiplication and addition, we can hope\n",
    "that its derivative is quite straightforward, and so it proves to be. We\n",
    "can start by thinking about the definition of the inner product, $$\n",
    "\\mathbf{a}^\\top\\mathbf{z} = \\sum_{i} a_i\n",
    "z_i,\n",
    "$$ which if we were to take the derivative with respect to $z_k$ would\n",
    "simply return the gradient of the one term in the sum for which the\n",
    "derivative was non zero, that of $a_k$, so we know that $$\n",
    "\\frac{\\text{d}}{\\text{d}z_k} \\mathbf{a}^\\top \\mathbf{z} = a_k\n",
    "$$ and by our definition of multivariate derivatives we can simply stack\n",
    "all the partial derivatives of this form in a vector to obtain the\n",
    "result that $$\n",
    "\\frac{\\text{d}}{\\text{d}\\mathbf{z}}\n",
    "\\mathbf{a}^\\top \\mathbf{z} = \\mathbf{a}.\n",
    "$$ The second rule that's required is differentiation of a 'matrix\n",
    "quadratic'. A scalar quadratic in $z$ with coefficient $c$ has the form\n",
    "$cz^2$. If $\\mathbf{z}$ is a $k\\times 1$ vector and $\\mathbf{C}$ is a\n",
    "$k \\times k$ *matrix* of coefficients then the matrix quadratic form is\n",
    "written as $\\mathbf{z}^\\top \\mathbf{C}\\mathbf{z}$, which is itself a\n",
    "*scalar* quantity, but it is a function of a *vector*.\n",
    "\n",
    "### Matching Dimensions in Matrix Multiplications\n",
    "\n",
    "There's a trick for telling that it's a scalar result. When you are\n",
    "doing maths with matrices, it's always worth pausing to perform a quick\n",
    "sanity check on the dimensions. Matrix multplication only works when the\n",
    "dimensions match. To be precise, the 'inner' dimension of the matrix\n",
    "must match. What is the inner dimension. If we multiply two matrices\n",
    "$\\mathbf{A}$ and $\\mathbf{B}$, the first of which has $k$ rows and\n",
    "$\\ell$ columns and the second of which has $p$ rows and $q$ columns,\n",
    "then we can check whether the multiplication works by writing the\n",
    "dimensionalities next to each other, $$\n",
    "\\mathbf{A} \\mathbf{B} \\rightarrow (k \\times\n",
    "\\underbrace{\\ell)(p}_\\text{inner dimensions} \\times q) \\rightarrow (k\\times q).\n",
    "$$ The inner dimensions are the two inside dimensions, $\\ell$ and $p$.\n",
    "The multiplication will only work if $\\ell=p$. The result of the\n",
    "multiplication will then be a $k\\times q$ matrix: this dimensionality\n",
    "comes from the 'outer dimensions'. Note that matrix multiplication is\n",
    "not [*commutative*](http://en.wikipedia.org/wiki/Commutative_property).\n",
    "And if you change the order of the multiplication, $$\n",
    "\\mathbf{B} \\mathbf{A} \\rightarrow (\\ell \\times \\underbrace{k)(q}_\\text{inner dimensions} \\times p) \\rightarrow (\\ell \\times p).\n",
    "$$ firstly it may no longer even work, because now the condition is that\n",
    "$k=q$, and secondly the result could be of a different dimensionality.\n",
    "An exception is if the matrices are square matrices (e.g. same number of\n",
    "rows as columns) and they are both *symmetric*. A symmetric matrix is\n",
    "one for which $\\mathbf{A}=\\mathbf{A}^\\top$, or equivalently,\n",
    "$a_{i,j} = a_{j,i}$ for all $i$ and $j$.\n",
    "\n",
    "You will need to get used to working with matrices and vectors applying\n",
    "and developing new machine learning techniques. You should have come\n",
    "across them before, but you may not have used them as extensively as we\n",
    "will now do in this course. You should get used to using this trick to\n",
    "check your work and ensure you know what the dimension of an output\n",
    "matrix should be. For our matrix quadratic form, it turns out that we\n",
    "can see it as a special type of inner product. $$\n",
    "\\mathbf{z}^\\top\\mathbf{C}\\mathbf{z} \\rightarrow (1\\times\n",
    "\\underbrace{k) (k}_\\text{inner dimensions}\\times k) (k\\times 1) \\rightarrow\n",
    "\\mathbf{b}^\\top\\mathbf{z}\n",
    "$$ where $\\mathbf{b} = \\mathbf{C}\\mathbf{z}$ so therefore the result is\n",
    "a scalar, $$\n",
    "\\mathbf{b}^\\top\\mathbf{z} \\rightarrow\n",
    "(1\\times \\underbrace{k) (k}_\\text{inner dimensions}\\times 1) \\rightarrow\n",
    "(1\\times 1)\n",
    "$$ where a $(1\\times 1)$ matrix is recognised as a scalar.\n",
    "\n",
    "This implies that we should be able to differentiate this form, and\n",
    "indeed the rule for its differentiation is slightly more complex than\n",
    "the inner product, but still quite simple, $$\n",
    "\\frac{\\text{d}}{\\text{d}\\mathbf{z}}\n",
    "\\mathbf{z}^\\top\\mathbf{C}\\mathbf{z}= \\mathbf{C}\\mathbf{z} + \\mathbf{C}^\\top\n",
    "\\mathbf{z}.\n",
    "$$ Note that in the special case where $\\mathbf{C}$ is symmetric then we\n",
    "have $\\mathbf{C} = \\mathbf{C}^\\top$ and the derivative simplifies to $$\n",
    "\\frac{\\text{d}}{\\text{d}\\mathbf{z}} \\mathbf{z}^\\top\\mathbf{C}\\mathbf{z}=\n",
    "2\\mathbf{C}\\mathbf{z}.\n",
    "$$\n",
    "\n",
    "## Differentiate the Objective\n",
    "\n",
    "First, we need to compute the full objective by substituting our\n",
    "prediction function into the objective function to obtain the objective\n",
    "in terms of $\\mappingVector$. Doing this we obtain $$\n",
    "\\errorFunction(\\mappingVector)= (\\dataVector - \\inputMatrix\\mappingVector)^\\top (\\dataVector - \\inputMatrix\\mappingVector).\n",
    "$$ We now need to differentiate this *quadratic form* to find the\n",
    "minimum. We differentiate with respect to the *vector* $\\mappingVector$.\n",
    "But before we do that, we'll expand the brackets in the quadratic form\n",
    "to obtain a series of scalar terms. The rules for bracket expansion\n",
    "across the vectors are similar to those for the scalar system giving, $$\n",
    "(\\mathbf{a} - \\mathbf{b})^\\top\n",
    "(\\mathbf{c} - \\mathbf{d}) = \\mathbf{a}^\\top \\mathbf{c} - \\mathbf{a}^\\top\n",
    "\\mathbf{d} - \\mathbf{b}^\\top \\mathbf{c} + \\mathbf{b}^\\top \\mathbf{d}\n",
    "$$ which substituting for $\\mathbf{a} = \\mathbf{c} = \\dataVector$ and\n",
    "$\\mathbf{b}=\\mathbf{d} = \\inputMatrix\\mappingVector$ gives $$\n",
    "\\errorFunction(\\mappingVector)=\n",
    "\\dataVector^\\top\\dataVector - 2\\dataVector^\\top\\inputMatrix\\mappingVector +\n",
    "\\mappingVector^\\top\\inputMatrix^\\top\\inputMatrix\\mappingVector\n",
    "$$ where we used the fact that\n",
    "$\\dataVector^\\top\\inputMatrix\\mappingVector=\\mappingVector^\\top\\inputMatrix^\\top\\dataVector$.\n",
    "Now we can use our rules of differentiation to compute the derivative of\n",
    "this form, which is, $$\n",
    "\\frac{\\text{d}}{\\text{d}\\mappingVector}\\errorFunction(\\mappingVector)=- 2\\inputMatrix^\\top \\dataVector +\n",
    "2\\inputMatrix^\\top\\inputMatrix\\mappingVector,\n",
    "$$ where we have exploited the fact that $\\inputMatrix^\\top\\inputMatrix$\n",
    "is symmetric to obtain this result.\n",
    "\n",
    "### Question 5\n",
    "\n",
    "Use the equivalence between our vector and our matrix formulations of\n",
    "linear regression, alongside our definition of vector derivates, to\n",
    "match the gradients we've computed directly for\n",
    "$\\frac{\\text{d}\\errorFunction(c, m)}{\\text{d}c}$ and\n",
    "$\\frac{\\text{d}\\errorFunction(c, m)}{\\text{d}m}$ to those for\n",
    "$\\frac{\\text{d}\\errorFunction(\\mappingVector)}{\\text{d}\\mappingVector}$.\n",
    "\n",
    "*20 marks*\n",
    "\n",
    "### Write your answer to Question 5 here\n",
    "\n",
    "# Update Equation for Global Optimum\n",
    "\n",
    "Once again, we need to find the minimum of our objective function. Using\n",
    "our likelihood for multiple input regression we can now minimize for our\n",
    "parameter vector $\\mappingVector$. Firstly, just as in the single input\n",
    "case, we seek stationary points by find parameter vectors that solve for\n",
    "when the gradients are zero, $$\n",
    "\\mathbf{0}=- 2\\inputMatrix^\\top\n",
    "\\dataVector + 2\\inputMatrix^\\top\\inputMatrix\\mappingVector,\n",
    "$$ where $\\mathbf{0}$ is a *vector* of zeros. Rearranging this equation\n",
    "we find the solution to be $$\n",
    "\\mappingVector = \\left[\\inputMatrix^\\top \\inputMatrix\\right]^{-1} \\inputMatrix^\\top\n",
    "\\dataVector\n",
    "$$ where $\\mathbf{A}^{-1}$ denotes [*matrix\n",
    "inverse*](http://en.wikipedia.org/wiki/Invertible_matrix).\n",
    "\n",
    "## Solving the Multivariate System\n",
    "\n",
    "The solution for $\\mappingVector$ is given in terms of a matrix inverse,\n",
    "but computation of a matrix inverse requires, in itself, an algorithm to\n",
    "resolve it. You'll know this if you had to invert, by hand, a\n",
    "$3\\times 3$ matrix in high school. From a numerical stability\n",
    "perspective, it is also best not to compute the matrix inverse directly,\n",
    "but rather to ask the computer to *solve* the system of linear equations\n",
    "given by\n",
    "$$\\inputMatrix^\\top\\inputMatrix \\mappingVector = \\inputMatrix^\\top\\dataVector$$\n",
    "for $\\mappingVector$. This can be done in `numpy` using the command"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.linalg.solve?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "so we can obtain the solution using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))\n",
    "print(w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can map it back to the liner regression and plot the fit as follows\n",
    "\n",
    "## Multivariate Linear Regression\n",
    "\n",
    "A major advantage of the new system is that we can build a linear\n",
    "regression on a multivariate system. The matrix calculus didn't specify\n",
    "what the length of the vector $\\inputVector$ should be, or equivalently\n",
    "the size of the design matrix.\n",
    "\n",
    "## Movie Body Count Data\n",
    "\n",
    "Let's consider the movie body count data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.movie_body_count()\n",
    "movies = data['Y']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's remind ourselves of the features we've been provided with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(', '.join(movies.columns))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will build a design matrix based on the numeric features: year,\n",
    "Body\\_Count, Length\\_Minutes in an effort to predict the rating. We\n",
    "build the design matrix as follows:\n",
    "\n",
    "## Relation to Single Input System\n",
    "\n",
    "Bias as an additional feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "select_features = ['Year', 'Body_Count', 'Length_Minutes']\n",
    "X = movies[select_features]\n",
    "X['Eins'] = 1 # add a column for the offset\n",
    "y = movies[['IMDB_Rating']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's perform a linear regression. But this time, we will create a\n",
    "pandas data frame for the result so we can store it in a form that we\n",
    "can visualise easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w = pd.DataFrame(data=np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y)),  # solve linear regression here\n",
    "                 index = X.columns,  # columns of X become rows of w\n",
    "                 columns=['regression_coefficient']) # the column of X is the value of regression coefficient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the residuals to see how good our estimates are"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(y - np.dot(X, w)).hist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which shows our model *hasn't* yet done a great job of representation,\n",
    "because the spread of values is large. We can check what the rating is\n",
    "dominated by in terms of regression coefficients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although we have to be a little careful about interpretation because our\n",
    "input values live on different scales, however it looks like we are\n",
    "dominated by the bias, with a small negative effect for later films (but\n",
    "bear in mind the years are large, so this effect is probably larger than\n",
    "it looks) and a positive effect for length. So it looks like long\n",
    "earlier films generally do better, but the residuals are so high that we\n",
    "probably haven't modelled the system very well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.lib.display import YouTubeVideo\n",
    "YouTubeVideo('ui-uNlFHoms')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Figure: <i>MLAI Lecture 15 from 2014 on Multivariate Regression.</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.lib.display import YouTubeVideo\n",
    "YouTubeVideo('78YNphT90-k')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Figure: <i>MLAI Lecture 3 from 2012 on Maximum Likelihood</i> }\n",
    "\n",
    "## Solution with QR Decomposition\n",
    "\n",
    "Performing a solve instead of a matrix inverse is the more numerically\n",
    "stable approach, but we can do even better. A\n",
    "[QR-decomposition](http://en.wikipedia.org/wiki/QR_decomposition) of a\n",
    "matrix factorises it into a matrix which is an orthogonal matrix\n",
    "$\\mathbf{Q}$, so that $\\mathbf{Q}^\\top \\mathbf{Q} = \\eye$. And a matrix\n",
    "which is upper triangular, $\\mathbf{R}$. $$\n",
    "\\inputMatrix^\\top \\inputMatrix \\boldsymbol{\\beta} =\n",
    "\\inputMatrix^\\top \\dataVector\n",
    "$$ $$\n",
    "(\\mathbf{Q}\\mathbf{R})^\\top\n",
    "(\\mathbf{Q}\\mathbf{R})\\boldsymbol{\\beta} = (\\mathbf{Q}\\mathbf{R})^\\top\n",
    "\\dataVector\n",
    "$$ $$\n",
    "\\mathbf{R}^\\top (\\mathbf{Q}^\\top \\mathbf{Q}) \\mathbf{R}\n",
    "\\boldsymbol{\\beta} = \\mathbf{R}^\\top \\mathbf{Q}^\\top \\dataVector\n",
    "$$ $$\n",
    "\\mathbf{R}^\\top \\mathbf{R} \\boldsymbol{\\beta} = \\mathbf{R}^\\top \\mathbf{Q}^\\top\n",
    "\\dataVector\n",
    "$$ $$\n",
    "\\mathbf{R} \\boldsymbol{\\beta} = \\mathbf{Q}^\\top \\dataVector\n",
    "$$ This is a more numerically stable solution because it removes the\n",
    "need to compute $\\inputMatrix^\\top\\inputMatrix$ as an intermediate.\n",
    "Computing $\\inputMatrix^\\top\\inputMatrix$ is a bad idea because it\n",
    "involves squaring all the elements of $\\inputMatrix$ and thereby\n",
    "potentially reducing the numerical precision with which we can represent\n",
    "the solution. Operating on $\\inputMatrix$ directly preserves the\n",
    "numerical precision of the model.\n",
    "\n",
    "This can be more particularly seen when we begin to work with *basis\n",
    "functions* in the next session. Some systems that can be resolved with\n",
    "the QR decomposition can not be resolved by using solve directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy as sp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Q, R = np.linalg.qr(X)\n",
    "w = sp.linalg.solve_triangular(R, np.dot(Q.T, y)) \n",
    "w = pd.DataFrame(w, index=X.columns)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading\n",
    "\n",
    "-   Section 1.3 of @Rogers:book11 for Matrix & Vector Review.\n",
    "\n",
    "## Basis Functions \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/basis-functions-intro.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "Here's the idea, instead of working directly on the original input\n",
    "space, $\\inputVector$, we build models in a new space,\n",
    "$\\basisVector(\\inputVector)$ where $\\basisVector(\\cdot)$ is a\n",
    "*vector-valued* function that is defined on the space $\\inputVector$.\n",
    "\n",
    "## Quadratic Basis\n",
    "\n",
    "Remember, that a *vector-valued function* is just a vector that contains\n",
    "functions instead of values. Here's an example for a one dimensional\n",
    "input space, $x$, being projected to a *quadratic* basis. First we\n",
    "consider each basis function in turn, we can think of the elements of\n",
    "our vector as being indexed so that we have $$\n",
    "\\begin{align*}\n",
    "\\basisFunc_1(\\inputScalar) = 1, \\\\\n",
    "\\basisFunc_2(\\inputScalar) = x, \\\\\n",
    "\\basisFunc_3(\\inputScalar) = \\inputScalar^2.\n",
    "\\end{align*}\n",
    "$$ Now we can consider them together by placing them in a vector, $$\n",
    "\\basisVector(\\inputScalar) = \\begin{bmatrix} 1\\\\ x \\\\ \\inputScalar^2\\end{bmatrix}.\n",
    "$$ For the vector-valued function, we have simply collected the\n",
    "different functions together in the same vector making them notationally\n",
    "easier to deal with in our mathematics.\n",
    "\n",
    "When we consider the vector-valued function for each data point, then we\n",
    "place all the data into a matrix. The result is a matrix valued\n",
    "function, $$\n",
    "\\basisMatrix(\\inputVector) = \n",
    "\\begin{bmatrix} 1 & \\inputScalar_1 &\n",
    "\\inputScalar_1^2 \\\\\n",
    "1 & \\inputScalar_2 & \\inputScalar_2^2\\\\\n",
    "\\vdots & \\vdots & \\vdots \\\\\n",
    "1 & \\inputScalar_n & \\inputScalar_n^2\n",
    "\\end{bmatrix}\n",
    "$$ where we are still in the one dimensional input setting so\n",
    "$\\inputVector$ here represents a vector of our inputs with $\\numData$\n",
    "elements.\n",
    "\n",
    "Let's try constructing such a matrix for a set of inputs. First of all,\n",
    "we create a function that returns the matrix valued function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def quadratic(x, **kwargs):\n",
    "    \"\"\"Take in a vector of input values and return the design matrix associated \n",
    "    with the basis functions.\"\"\"\n",
    "    return np.hstack([np.ones((x.shape[0], 1)), x, x**2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Functions Derived from Quadratic Basis\n",
    "\n",
    "$$\n",
    "\\mappingFunction(\\inputScalar) = {\\color{cyan}\\mappingScalar_0}   + {\\color{green}\\mappingScalar_1 \\inputScalar} + {\\color{yellow}\\mappingScalar_2 \\inputScalar^2}\n",
    "$$\n",
    "\n",
    "{figure{<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/\\basisfunction002.svg\" class=\"quadratic-basis-2\" align=\"The set of functions which are combined to form a *quadratic* basis.\" style=\"vertical-align:middle;\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('\\basisfunction{num_basis:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            num_basis=IntSlider(0,0,2,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function takes in an $\\numData \\times 1$ dimensional vector and\n",
    "returns an $\\numData \\times 3$ dimensional *design matrix* containing\n",
    "the basis functions. We can plot those basis functions against there\n",
    "input as follows.\n",
    "\n",
    "The actual function we observe is then made up of a sum of these\n",
    "functions. This is the reason for the name basis. The term *basis* means\n",
    "'the underlying support or foundation for an idea, argument, or\n",
    "process', and in this context they form the underlying support for our\n",
    "prediction function. Our prediction function can only be composed of a\n",
    "weighted linear sum of our basis functions.\n",
    "\n",
    "## Quadratic Functions\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/quadratic_function002.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Functions constructed by weighted sum of the components of a\n",
    "quadratic basis.</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('quadratic_function{num_function:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            num_basis=IntSlider(0,0,2,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polynomial Fits to Olympic Data \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/olympic-polynomials.md\" target=\"_blank\" >edit</a>\\]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "import mlai\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "basis = mlai.polynomial\n",
    "\n",
    "data = pods.datasets.olympic_marathon_men()\n",
    "\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "xlim = [1892, 2020]\n",
    "\n",
    "basis=mlai.Basis(mlai.polynomial, number=1, data_limits=xlim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('olympic_LM_polynomial_number{num_basis:0>3}.svg',\n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            num_basis=IntSlider(1,1,27,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_LM_polynomial_number002.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Fit of a 1 degree polynomial to the olympic marathon\n",
    "data.</i>\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_LM_polynomial_number003.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Fit of a 2 degree polynomial to the olympic marathon\n",
    "data.</i>\n",
    "\n",
    "# Underdetermined System \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/underdetermined-system.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "What about the situation where you have more parameters than data in\n",
    "your simultaneous equation? This is known as an *underdetermined*\n",
    "system. In fact this set up is in some sense *easier* to solve, because\n",
    "we don't need to think about introducing a slack variable (although it\n",
    "might make a lot of sense from a *modelling* perspective to do so).\n",
    "\n",
    "The way Laplace proposed resolving an overdetermined system, was to\n",
    "introduce slack variables, $\\noiseScalar_i$, which needed to be\n",
    "estimated for each point. The slack variable represented the difference\n",
    "between our actual prediction and the true observation. This is known as\n",
    "the *residual*. By introducing the slack variable we now have an\n",
    "additional $n$ variables to estimate, one for each data point,\n",
    "$\\{\\noiseScalar_i\\}$. This actually turns the overdetermined system into\n",
    "an underdetermined system. Introduction of $n$ variables, plus the\n",
    "original $m$ and $c$ gives us $\\numData+2$ parameters to be estimated\n",
    "from $n$ observations, which actually makes the system\n",
    "*underdetermined*. However, we then made a probabilistic assumption\n",
    "about the slack variables, we assumed that the slack variables were\n",
    "distributed according to a probability density. And for the moment we\n",
    "have been assuming that density was the Gaussian,\n",
    "$$\\noiseScalar_i \\sim \\gaussianSamp{0}{\\dataStd^2},$$ with zero mean and\n",
    "variance $\\dataStd^2$.\n",
    "\n",
    "The follow up question is whether we can do the same thing with the\n",
    "parameters. If we have two parameters and only one unknown can we place\n",
    "a probability distribution over the parameters, as we did with the slack\n",
    "variables? The answer is yes.\n",
    "\n",
    "## Underdetermined System"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('under_determined_system{samp:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', samp=IntSlider(0, 0, 10, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/under_determined_system009.svg\" class=\"\" align=\"40%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>An underdetermined system can be fit by considering\n",
    "uncertainty. Multiple solutions are consistent with one specified\n",
    "point.</i>\n",
    "\n",
    "## Alan Turing \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/alan-turing-marathon.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/turing-times.gif\" style=\"width:100%\">\n",
    "</td>\n",
    "<td width=\"50%\">\n",
    "<img class=\"\" src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/turing-run.jpg\" style=\"width:50%\">\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "Figure: <i>Alan Turing, in 1946 he was only 11 minutes slower than the\n",
    "winner of the 1948 games. Would he have won a hypothetical games held in\n",
    "1946? Source: [Alan Turing Internet\n",
    "Scrapbook](http://www.turing.org.uk/scrapbook/run.html).</i>\n",
    "\n",
    "If we had to summarise the objectives of machine learning in one word, a\n",
    "very good candidate for that word would be *generalization*. What is\n",
    "generalization? From a human perspective it might be summarised as the\n",
    "ability to take lessons learned in one domain and apply them to another\n",
    "domain. If we accept the definition given in the first session for\n",
    "machine learning, $$\n",
    "\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}\n",
    "$$ then we see that without a model we can't generalise: we only have\n",
    "data. Data is fine for answering very specific questions, like \"Who won\n",
    "the Olympic Marathon in 2012?\", because we have that answer stored,\n",
    "however, we are not given the answer to many other questions. For\n",
    "example, Alan Turing was a formidable marathon runner, in 1946 he ran a\n",
    "time 2 hours 46 minutes (just under four minutes per kilometer, faster\n",
    "than I and most of the other [Endcliffe Park\n",
    "Run](http://www.parkrun.org.uk/sheffieldhallam/) runners can do 5 km).\n",
    "What is the probability he would have won an Olympics if one had been\n",
    "held in 1946?\n",
    "\n",
    "To answer this question we need to generalize, but before we formalize\n",
    "the concept of generalization let's introduce some formal representation\n",
    "of what it means to generalize in machine learning.\n",
    "\n",
    "## Prior Distribution \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/bayesian-regression1d-short.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "The tradition in Bayesian inference is to place a probability density\n",
    "over the parameters of interest in your model. This choice is made\n",
    "regardless of whether you generally believe those parameters to be\n",
    "stochastic or deterministic in origin. In other words, to a Bayesian,\n",
    "the modelling treatment does not differentiate between epistemic and\n",
    "aleatoric uncertainty. For linear regression we could consider the\n",
    "following Gaussian prior on the intercept parameter,\n",
    "$$c \\sim \\gaussianSamp{0}{\\alpha_1}$$ where $\\alpha_1$ is the variance\n",
    "of the prior distribution, its mean being zero.\n",
    "\n",
    "## Posterior Distribution\n",
    "\n",
    "The prior distribution is combined with the likelihood of the data given\n",
    "the parameters $p(\\dataScalar|c)$ to give the posterior via *Bayes'\n",
    "rule*, $$\n",
    "  p(c|\\dataScalar) = \\frac{p(\\dataScalar|c)p(c)}{p(\\dataScalar)}\n",
    "  $$ where $p(\\dataScalar)$ is the marginal probability of the data,\n",
    "obtained through integration over the joint density,\n",
    "$p(\\dataScalar, c)=p(\\dataScalar|c)p(c)$. Overall the equation can be\n",
    "summarized as, $$\n",
    "  \\text{posterior} = \\frac{\\text{likelihood}\\times \\text{prior}}{\\text{marginal likelihood}}.\n",
    "  $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ipywidgets import IntSlider\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('dem_gaussian{stage:0>2}.svg', \n",
    "                            diagrams='../slides/diagrams/ml', \n",
    "                            stage=IntSlider(1, 1, 3, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/dem_gaussian003.svg\" class=\"\" align=\"70%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Combining a Gaussian likelihood with a Gaussian prior to form\n",
    "a Gaussian posterior</i>\n",
    "\n",
    "Another way of seeing what's going on is to note that the numerator of\n",
    "Bayes' rule merely multiplies the likelihood by the prior. The\n",
    "denominator, is not a function of $c$. So the functional form is\n",
    "entirely determined by the multiplication of prior and likelihood. This\n",
    "has the effect of ensuring that the posterior only has probability mass\n",
    "in regions where both the prior and the likelihood have probability\n",
    "mass.\n",
    "\n",
    "The marginal likelihood, $p(\\dataScalar)$, operates to ensure that the\n",
    "distribution is normalised.\n",
    "\n",
    "For the Gaussian case, the normalisation of the posterior can be\n",
    "performed analytically. This is because both the prior and the\n",
    "likelihood have the form of an *exponentiated quadratic*, $$\n",
    "\\exp(a^2)\\exp(b^2) = \\exp(a^2 + b^2),\n",
    "$$ and the properties of the exponential mean that the product of two\n",
    "exponentiated quadratics is also an exponentiated quadratic. That\n",
    "implies that the posterior is also Gaussian, because a normalized\n",
    "exponentiated quadratic is a Gaussian distribution.[^1]\n",
    "\n",
    "$$p(c) = \\frac{1}{\\sqrt{2\\pi\\alpha_1}} \\exp\\left(-\\frac{1}{2\\alpha_1}c^2\\right)$$\n",
    "$$p(\\dataVector|\\inputVector, c, m, \\dataStd^2) = \\frac{1}{\\left(2\\pi\\dataStd^2\\right)^{\\frac{\\numData}{2}}} \\exp\\left(-\\frac{1}{2\\dataStd^2}\\sum_{i=1}^\\numData(\\dataScalar_i - m\\inputScalar_i - c)^2\\right)$$\n",
    "\n",
    "$$p(c| \\dataVector, \\inputVector, m, \\dataStd^2) = \\frac{p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c)}{p(\\dataVector|\\inputVector, m, \\dataStd^2)}$$\n",
    "\n",
    "$$p(c| \\dataVector, \\inputVector, m, \\dataStd^2) =  \\frac{p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c)}{\\int p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c) \\text{d} c}$$\n",
    "\n",
    "$$p(c| \\dataVector, \\inputVector, m, \\dataStd^2) \\propto  p(\\dataVector|\\inputVector, c, m, \\dataStd^2)p(c)$$\n",
    "\n",
    "$$\\begin{aligned}\n",
    "    \\log p(c | \\dataVector, \\inputVector, m, \\dataStd^2) =&-\\frac{1}{2\\dataStd^2} \\sum_{i=1}^\\numData(\\dataScalar_i-c - m\\inputScalar_i)^2-\\frac{1}{2\\alpha_1} c^2 + \\text{const}\\\\\n",
    "     = &-\\frac{1}{2\\dataStd^2}\\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i)^2 -\\left(\\frac{\\numData}{2\\dataStd^2} + \\frac{1}{2\\alpha_1}\\right)c^2\\\\\n",
    "    & + c\\frac{\\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i)}{\\dataStd^2},\n",
    "  \\end{aligned}$$\n",
    "\n",
    "complete the square of the quadratic form to obtain\n",
    "$$\\log p(c | \\dataVector, \\inputVector, m, \\dataStd^2) = -\\frac{1}{2\\tau^2}(c - \\mu)^2 +\\text{const},$$\n",
    "where $\\tau^2 = \\left(\\numData\\dataStd^{-2} +\\alpha_1^{-1}\\right)^{-1}$\n",
    "and\n",
    "$\\mu = \\frac{\\tau^2}{\\dataStd^2} \\sum_{i=1}^\\numData(\\dataScalar_i-m\\inputScalar_i)$.\n",
    "\n",
    "## Two Dimensional Gaussian \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/two-d-gaussian.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "Consider the distribution of height (in meters) of an adult male human\n",
    "population. We will approximate the marginal density of heights as a\n",
    "Gaussian density with mean given by $1.7\\text{m}$ and a standard\n",
    "deviation of $0.15\\text{m}$, implying a variance of $\\dataStd^2=0.0225$,\n",
    "$$\n",
    "  p(h) \\sim \\gaussianSamp{1.7}{0.0225}.\n",
    "  $$ Similarly, we assume that weights of the population are distributed\n",
    "a Gaussian density with a mean of $75 \\text{kg}$ and a standard\n",
    "deviation of $6 kg$ (implying a variance of 36), $$\n",
    "  p(w) \\sim \\gaussianSamp{75}{36}.\n",
    "  $$\n",
    "\n",
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/height_weight_gaussian.svg\" class=\"\" align=\"70%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Gaussian distributions for height and weight.</i>\n",
    "\n",
    "## Independence Assumption\n",
    "\n",
    "First of all, we make an independence assumption, we assume that height\n",
    "and weight are independent. The definition of probabilistic independence\n",
    "is that the joint density, $p(w, h)$, factorizes into its marginal\n",
    "densities, $$\n",
    "  p(w, h) = p(w)p(h).\n",
    "  $$ Given this assumption we can sample from the joint distribution by\n",
    "independently sampling weights and heights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('independent_height_weight{fig:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            fig=IntSlider(0, 0, 7, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/independent_height_weight007.svg\" class=\"\" align=\"70%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Samples from independent Gaussian variables that might\n",
    "represent heights and weights.</i>\n",
    "\n",
    "In reality height and weight are *not* independent. Taller people tend\n",
    "on average to be heavier, and heavier people are likely to be taller.\n",
    "This is reflected by the *body mass index*. A ratio suggested by one of\n",
    "the fathers of statistics, Adolphe Quetelet. Quetelet was interested in\n",
    "the notion of the *average man* and collected various statistics about\n",
    "people. He defined the BMI to be, $$\n",
    "\\text{BMI} = \\frac{w}{h^2}\n",
    "$$To deal with this dependence we now introduce the notion of\n",
    "*correlation* to the multivariate Gaussian density.\n",
    "\n",
    "## Sampling Two Dimensional Variables \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/two-d-gaussian-correlated-sample.md\" target=\"_blank\" >edit</a>\\]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "pods.notebook.display_plots('correlated_height_weight{fig:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            fig=IntSlider(0, 0, 7, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/correlated_height_weight007.svg\" class=\"\" align=\"70%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Samples from *correlated* Gaussian variables that might\n",
    "represent heights and weights.</i>\n",
    "\n",
    "## Independent Gaussians \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/two-d-gaussian-maths.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "$$\n",
    "p(w, h) = p(w)p(h)\n",
    "$$\n",
    "\n",
    "$$\n",
    "p(w, h) = \\frac{1}{\\sqrt{2\\pi \\dataStd_1^2}\\sqrt{2\\pi\\dataStd_2^2}} \\exp\\left(-\\frac{1}{2}\\left(\\frac{(w-\\meanScalar_1)^2}{\\dataStd_1^2} + \\frac{(h-\\meanScalar_2)^2}{\\dataStd_2^2}\\right)\\right)\n",
    "$$\n",
    "\n",
    "$$\n",
    "p(w, h) = \\frac{1}{\\sqrt{2\\pi\\dataStd_1^22\\pi\\dataStd_2^2}} \\exp\\left(-\\frac{1}{2}\\left(\\begin{bmatrix}w \\\\ h\\end{bmatrix} - \\begin{bmatrix}\\meanScalar_1 \\\\ \\meanScalar_2\\end{bmatrix}\\right)^\\top\\begin{bmatrix}\\dataStd_1^2& 0\\\\0&\\dataStd_2^2\\end{bmatrix}^{-1}\\left(\\begin{bmatrix}w \\\\ h\\end{bmatrix} - \\begin{bmatrix}\\meanScalar_1 \\\\ \\meanScalar_2\\end{bmatrix}\\right)\\right)\n",
    "$$\n",
    "\n",
    "$$\n",
    "p(\\dataVector) = \\frac{1}{\\det{2\\pi \\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\mathbf{D}^{-1}(\\dataVector - \\meanVector)\\right)\n",
    "$$\n",
    "\n",
    "## Correlated Gaussian\n",
    "\n",
    "Form correlated from original by rotating the data space using matrix\n",
    "$\\rotationMatrix$.\n",
    "\n",
    "$$\n",
    "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\mathbf{D}^{-1}(\\dataVector - \\meanVector)\\right)\n",
    "$$\n",
    "\n",
    "$$\n",
    "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\rotationMatrix^\\top\\dataVector - \\rotationMatrix^\\top\\meanVector)^\\top\\mathbf{D}^{-1}(\\rotationMatrix^\\top\\dataVector - \\rotationMatrix^\\top\\meanVector)\\right)\n",
    "$$\n",
    "\n",
    "$$\n",
    "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\mathbf{D}}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\rotationMatrix\\mathbf{D}^{-1}\\rotationMatrix^\\top(\\dataVector - \\meanVector)\\right)\n",
    "$$ this gives a covariance matrix: $$\n",
    "\\covarianceMatrix^{-1} = \\rotationMatrix \\mathbf{D}^{-1} \\rotationMatrix^\\top\n",
    "$$\n",
    "\n",
    "$$\n",
    "p(\\dataVector) = \\frac{1}{\\det{2\\pi\\covarianceMatrix}^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2}(\\dataVector - \\meanVector)^\\top\\covarianceMatrix^{-1} (\\dataVector - \\meanVector)\\right)\n",
    "$$ this gives a covariance matrix: $$\n",
    "\\covarianceMatrix = \\rotationMatrix \\mathbf{D} \\rotationMatrix^\\top\n",
    "$$\n",
    "\n",
    "## Generating from the Model \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/prior-sampling-basis.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "A very important aspect of probabilistic modelling is to *sample* from\n",
    "your model to see what type of assumptions you are making about your\n",
    "data. In this case that involves a two stage process.\n",
    "\n",
    "1.  Sample a candiate parameter vector from the prior.\n",
    "2.  Place the candidate parameter vector in the likelihood and sample\n",
    "    functions conditiond on that candidate vector.\n",
    "3.  Repeat to try and characterise the type of functions you are\n",
    "    generating.\n",
    "\n",
    "Given a prior variance (as defined above) we can now sample from the\n",
    "prior distribution and combine with a basis set to see what assumptions\n",
    "we are making about the functions *a priori* (i.e. before we've seen the\n",
    "data). Firstly we compute the basis function matrix. We will do it both\n",
    "for our training data, and for a range of prediction locations\n",
    "(`x_pred`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "num_data = x.shape[0]\n",
    "num_pred_data = 100 # how many points to use for plotting predictions\n",
    "x_pred = np.linspace(1890, 2016, num_pred_data)[:, None] # input locations for predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "now let's build the basis matrices. We define the polynomial basis as\n",
    "follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def polynomial(x, num_basis=2, loc=0., scale=1.):\n",
    "    degree=num_basis-1\n",
    "    degrees = np.arange(degree+1)\n",
    "    return ((x-loc)/scale)**degrees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loc=1950\n",
    "scale=1\n",
    "degree=4\n",
    "basis = mlai.Basis(polynomial, number=degree+1, loc=loc, scale=scale)\n",
    "Phi_pred = basis.Phi(x_pred)\n",
    "Phi = basis.Phi(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling from the Prior\n",
    "\n",
    "Now we will sample from the prior to produce a vector $\\mappingVector$\n",
    "and use it to plot a function which is representative of our belief\n",
    "*before* we fit the data. To do this we are going to use the properties\n",
    "of the Gaussian density and a sample from a *standard normal* using the\n",
    "function `np.random.normal`.\n",
    "\n",
    "## Scaling Gaussian-distributed Variables\n",
    "\n",
    "First, let's consider the case where we have one data point and one\n",
    "feature in our basis set. In otherwords $\\mappingFunctionVector$ would\n",
    "be a scalar, $\\mappingVector$ would be a scalar and $\\basisMatrix$ would\n",
    "be a scalar. In this case we have $$\n",
    "\\mappingFunction = \\basisScalar \\mappingScalar\n",
    "$$ If $\\mappingScalar$ is drawn from a normal density, $$\n",
    "\\mappingScalar \\sim \\gaussianSamp{\\meanScalar_\\mappingScalar}{c_\\mappingScalar}\n",
    "$$ and $\\basisScalar$ is a scalar value which we are given, then\n",
    "properties of the Gaussian density tell us that $$\n",
    "\\basisScalar \\mappingScalar \\sim \\gaussianSamp{\\basisScalar\\meanScalar_\\mappingScalar}{\\basisScalar^2c_\\mappingScalar}\n",
    "$$ Let's test this out numerically. First we will draw 200 samples from\n",
    "a standard normal,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w_vec = np.random.normal(size=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute the mean of these samples and their variance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('w sample mean is ', w_vec.mean())\n",
    "print('w sample variance is ', w_vec.var())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are close to zero (the mean) and one (the variance) as you'd\n",
    "expect. Now compute the mean and variance of the scaled version,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "phi = 7\n",
    "f_vec = phi*w_vec\n",
    "print('True mean should be phi*0 = 0.')\n",
    "print('True variance should be phi*phi*1 = ', phi*phi)\n",
    "print('f sample mean is ', f_vec.mean())\n",
    "print('f sample variance is ', f_vec.var())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you increase the number of samples then you will see that the sample\n",
    "mean and the sample variance begin to converge towards the true mean and\n",
    "the true variance. Obviously adding an offset to a sample from\n",
    "`np.random.normal` will change the mean. So if you want to sample from a\n",
    "Gaussian with mean `mu` and standard deviation `sigma` one way of doing\n",
    "it is to sample from the standard normal and scale and shift the result,\n",
    "so to sample a set of $\\mappingScalar$ from a Gaussian with mean\n",
    "$\\meanScalar$ and variance $\\alpha$,\n",
    "$$\\mappingScalar \\sim \\gaussianSamp{\\meanScalar}{\\alpha}$$ We can simply\n",
    "scale and offset samples from the *standard normal*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mu = 4 # mean of the distribution\n",
    "alpha = 2 # variance of the distribution\n",
    "w_vec = np.random.normal(size=200)*np.sqrt(alpha) + mu\n",
    "print('w sample mean is ', w_vec.mean())\n",
    "print('w sample variance is ', w_vec.var())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the `np.sqrt` is necesssary because we need to multiply by the\n",
    "standard deviation and we specified the variance as `alpha`. So scaling\n",
    "and offsetting a Gaussian distributed variable keeps the variable\n",
    "Gaussian, but it effects the mean and variance of the resulting\n",
    "variable.\n",
    "\n",
    "To get an idea of the overall shape of the resulting distribution, let's\n",
    "do the same thing with a histogram of the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now re-run this histogram with 100,000 samples and check that the both\n",
    "histograms look qualitatively Gaussian.\n",
    "\n",
    "## Sampling from the Prior\n",
    "\n",
    "Let's use this way of constructing samples from a Gaussian to check what\n",
    "functions look like *a priori*. The process will be as follows. First,\n",
    "we sample a random vector $K$ dimensional from `np.random.normal`. Then\n",
    "we scale it by $\\sqrt{\\alpha}$ to obtain a prior sample of\n",
    "$\\mappingVector$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "K = degree + 1\n",
    "z_vec = np.random.normal(size=K)\n",
    "w_sample = z_vec*np.sqrt(alpha)\n",
    "print(w_sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can combine our sample from the prior with the basis functions to\n",
    "create a function,\n",
    "\n",
    "This shows the recurring problem with the polynomial basis (note the\n",
    "scale on the left hand side!). Our prior allows relatively large\n",
    "coefficients for the basis associated with high polynomial degrees.\n",
    "Because we are operating with input values of around 2000, this leads to\n",
    "output functions of very high values. The fix we have used for this\n",
    "before is to rescale our data before we apply the polynomial basis to\n",
    "it. Above, we set the scale of the basis to 1. Here let's set it to 100\n",
    "and try again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scale = 100.\n",
    "basis = mlai.Basis(polynomial, number=degree+1, loc=loc, scale=scale)\n",
    "Phi_pred = basis.Phi(x_pred)\n",
    "Phi = basis.Phi(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to recompute the basis functions from above,\n",
    "\n",
    "Now let's loop through some samples and plot various functions as\n",
    "samples from this system,\n",
    "\n",
    "The predictions for the mean output can now be computed. We want the\n",
    "expected value of the predictions under the posterior distribution. In\n",
    "matrix form, the predictions can be computed as $$\n",
    "\\mappingFunctionVector = \\basisMatrix \\mappingVector.\n",
    "$$ This involves a matrix multiplication between a fixed matrix\n",
    "$\\basisMatrix$ and a vector that is drawn from a distribution\n",
    "$\\mappingVector$. Because $\\mappingVector$ is drawn from a distribution,\n",
    "this imples that $\\mappingFunctionVector$ should also be drawn from a\n",
    "distribution. There are two distributions we are interested in though.\n",
    "We have just been sampling from the *prior* distribution to see what\n",
    "sort of functions we get *before* looking at the data. In Bayesian\n",
    "inference, we need to computer the *posterior* distribution and sample\n",
    "from that density.\n",
    "\n",
    "## Computing the Posterior \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/posterior-computation-gaussian.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "We will now attampt to compute the *posterior distribution*. In the\n",
    "lecture we went through the maths that allows us to compute the\n",
    "posterior distribution for $\\mappingVector$. This distribution is also\n",
    "Gaussian, $$\n",
    "p(\\mappingVector | \\dataVector, \\inputVector, \\dataStd^2) = \\gaussianDist{\\mappingVector}{\\meanVector_\\mappingScalar}{\\covarianceMatrix_\\mappingScalar}\n",
    "$$ with covariance, $\\covarianceMatrix_\\mappingScalar$, given by $$\n",
    "\\covarianceMatrix_\\mappingScalar = \\left(\\dataStd^{-2}\\basisMatrix^\\top \\basisMatrix + \\alpha^{-1}\\eye\\right)^{-1}\n",
    "$$ whilst the mean is given by $$\n",
    "\\meanVector_\\mappingScalar = \\covarianceMatrix_\\mappingScalar \\dataStd^{-2}\\basisMatrix^\\top \\dataVector\n",
    "$$ Let's compute the posterior covariance and mean, then we'll sample\n",
    "from these densities to have a look at the posterior belief about\n",
    "$\\mappingVector$ once the data has been accounted for. Remember, the\n",
    "process of Bayesian inference involves combining the prior,\n",
    "$p(\\mappingVector)$ with the likelihood,\n",
    "$p(\\dataVector|\\inputVector, \\mappingVector)$ to form the posterior,\n",
    "$p(\\mappingVector | \\dataVector, \\inputVector)$ through Bayes' rule, $$\n",
    "p(\\mappingVector|\\dataVector, \\inputVector) = \\frac{p(\\dataVector|\\inputVector, \\mappingVector)p(\\mappingVector)}{p(\\dataVector)}\n",
    "$$ We've looked at the samples for our function\n",
    "$\\mappingFunctionVector = \\basisMatrix\\mappingVector$, which forms the\n",
    "mean of the Gaussian likelihood, under the prior distribution. I.e.\n",
    "we've sampled from $p(\\mappingVector)$ and multiplied the result by the\n",
    "basis matrix. Now we will sample from the posterior density,\n",
    "$p(\\mappingVector|\\dataVector, \\inputVector)$, and check that the new\n",
    "samples fit do correspond to the data, i.e. we want to check that the\n",
    "updated distribution includes information from the data set. First we\n",
    "need to compute the posterior mean and *covariance*.\n",
    "\n",
    "## Bayesian Inference in the Univariate Case\n",
    "\n",
    "This video talks about Bayesian inference across the single parameter,\n",
    "the offset $c$, illustrating how the prior and the likelihood combine in\n",
    "one dimension to form a posterior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.lib.display import YouTubeVideo\n",
    "YouTubeVideo('AvlnFnvFw_0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Figure: <i>Univariate Bayesian inference. Lecture 10 from 2012 MLAI\n",
    "Course.</i>\n",
    "\n",
    "## Multivariate Bayesian Inference\n",
    "\n",
    "This section of the lecture talks about how we extend the idea of\n",
    "Bayesian inference for the multivariate case. It goes through the\n",
    "multivariate Gaussian and how to complete the square in the linear\n",
    "algebra as we managed below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.lib.display import YouTubeVideo\n",
    "YouTubeVideo('Os1iqgpelPw')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Figure: <i>Multivariate Bayesian inference. Lecture 11 from 2012 MLAI\n",
    "course.</i>\n",
    "\n",
    "The lecture informs us the the posterior density for $\\mappingVector$ is\n",
    "given by a Gaussian density with covariance $$\n",
    "\\covarianceMatrix_w = \\left(\\dataStd^{-2}\\basisMatrix^\\top \\basisMatrix + \\alpha^{-1}\\eye\\right)^{-1}\n",
    "$$ and mean $$\n",
    "\\meanVector_w = \\covarianceMatrix_w\\dataStd^{-2}\\basisMatrix^\\top \\dataVector.\n",
    "$$\n",
    "\n",
    "### Question 1\n",
    "\n",
    "Compute the covariance for $\\mappingVector$ given the training data,\n",
    "call the resulting variable `w_cov`. Compute the mean for\n",
    "$\\mappingVector$ given the training data. Call the resulting variable\n",
    "`w_mean`. Assume that $\\dataStd^2 = 0.01$\n",
    "\n",
    "*10 marks*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write code for your answer to Question 1 in this box\n",
    "# provide the answers so that the code runs correctly otherwise you will loose marks!\n",
    "\n",
    "sigma2 = \n",
    "w_cov = \n",
    "w_mean = \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Olympic Data with Bayesian Polynomials \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/olympic-bayesian-polynomials.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "Five fold cross validation tests the ability of the model to\n",
    "*interpolate*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mlai\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('olympic_BLM_polynomial_number{num_basis:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml/', \n",
    "                            num_basis=IntSlider(1, 1, 27, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_BLM_polynomial_number026.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Bayesian fit with 26th degree polynomial and negative\n",
    "marginal log likelihood.</i>\n",
    "\n",
    "## Hold Out Validation\n",
    "\n",
    "For the polynomial fit, we will now look at *hold out* validation, where\n",
    "we are holding out some of the most recent points. This tests the abilit\n",
    "of our model to *extrapolate*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('olympic_val_BLM_polynomial_number{num_basis:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            num_basis=IntSlider(1, 1, 27, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_val_BLM_polynomial_number026.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Bayesian fit with 26th degree polynomial and hold out\n",
    "validation scores.</i>\n",
    "\n",
    "## 5-fold Cross Validation\n",
    "\n",
    "Five fold cross validation tests the ability of the model to\n",
    "*interpolate*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('olympic_5cv{part:0>2}_BLM_polynomial_number{num_basis:0>3}.svg', \n",
    "                            directory='../slides/diagrams/ml', \n",
    "                            part=(0, 5), \n",
    "                            num_basis=IntSlider(1, 1, 27, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_5cv05_BLM_polynomial_number026.svg\" class=\"\" align=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Bayesian fit with 26th degree polynomial and five fold cross\n",
    "validation scores.</i>\n",
    "\n",
    "## Marginal Likelihood \\[<a href=\"https://github.com/lawrennd/snippets/edit/main/_ml/includes/polynomial-marginal-likelihood.md\" target=\"_blank\" >edit</a>\\]\n",
    "\n",
    "-   The marginal likelihood can also be computed, it has the form: $$\n",
    "    p(\\dataVector|\\inputMatrix, \\dataStd^2, \\alpha) = \\frac{1}{(2\\pi)^\\frac{n}{2}\\left|\\kernelMatrix\\right|^\\frac{1}{2}} \\exp\\left(-\\frac{1}{2} \\dataVector^\\top \\kernelMatrix^{-1} \\dataVector\\right)\n",
    "    $$ where\n",
    "    $\\kernelMatrix = \\alpha \\basisMatrix\\basisMatrix^\\top + \\dataStd^2 \\eye$.\n",
    "\n",
    "-   So it is a zero mean $\\numData$-dimensional Gaussian with covariance\n",
    "    matrix $\\kernelMatrix$.\n",
    "\n",
    "# References {#references .unnumbered}\n",
    "\n",
    "[^1]: Note not all exponentiated quadratics can be normalized, to do so,\n",
    "    the coefficient associated with the variable squared,\n",
    "    $\\dataScalar^2$, must be strictly positive."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}