{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Probabilistic Machine Learning\n", "## AI Saturdays\n", "### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield\n", "### 2018-08-25\n", "\n", "**Abstract**: In this session we review the *probabilistic* approach to machine learning.\n", "We start with a review of probability, and introduce the concepts of\n", "probabilistic modelling. We then apply the approach in practice to Naive\n", "Bayesian classification. In this lecture we review the Bayesian\n", "formalism in the context of linear models, reviewing initially maximum\n", "likelihood and introducing basis functions as a way of driving\n", "non-linearity in the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Machine Learning?\n", "\n", "What is machine learning? At its most basic level machine learning is a\n", "combination of\n", "\n", "$$ \\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n", "\n", "where *data* is our observations. They can be actively or passively\n", "acquired (meta-data). The *model* contains our assumptions, based on\n", "previous experience. That experience can be other data, it can come from\n", "transfer learning, or it can merely be our beliefs about the\n", "regularities of the universe. In humans our models include our inductive\n", "biases. The *prediction* is an action to be taken or a categorization or\n", "a quality score. The reason that machine learning has become a mainstay\n", "of artificial intelligence is the importance of predictions in\n", "artificial intelligence. The data and the model are combined through\n", "computation.\n", "\n", "In practice we normally perform machine learning using two functions. To\n", "combine data with a model we typically make use of:\n", "\n", "**a prediction function** a function which is used to make the\n", "predictions. It includes our beliefs about the regularities of the\n", "universe, our assumptions about how the world works, e.g. smoothness,\n", "spatial similarities, temporal similarities.\n", "\n", "**an objective function** a function which defines the cost of\n", "misprediction. Typically it includes knowledge about the world's\n", "generating processes (probabilistic objectives) or the costs we pay for\n", "mispredictions (empiricial risk minimization).\n", "\n", "The combination of data and model through the prediction function and\n", "the objectie function leads to a *learning algorithm*. The class of\n", "prediction functions and objective functions we can make use of is\n", "restricted by the algorithms they lead to. If the prediction function or\n", "the objective function are too complex, then it can be difficult to find\n", "an appropriate learning algorithm. Much of the acdemic field of machine\n", "learning is the quest for new learning algorithms that allow us to bring\n", "different types of models and data together.\n", "\n", "A useful reference for state of the art in machine learning is the UK\n", "Royal Society Report, [Machine Learning: Power and Promise of Computers\n", "that Learn by\n", "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n", "\n", "You can also check my blog post on [\"What is Machine\n", "Learning?\"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probabilities\n", "\n", "We are now going to do some simple review of probabilities and use this\n", "review to explore some aspects of our data.\n", "\n", "A probability distribution expresses uncertainty about the outcome of an\n", "event. We often encode this uncertainty in a variable. So if we are\n", "considering the outcome of an event, $Y$, to be a coin toss, then we\n", "might consider $Y=1$ to be heads and $Y=0$ to be tails. We represent the\n", "probability of a given outcome with the notation: \n", "$$\n", "P(Y=1) = 0.5\n", "$$ \n", "The first rule of probability is that the probability must normalize.\n", "The sum of the probability of all events must equal 1. So if the\n", "probability of heads ($Y=1$) is 0.5, then the probability of tails (the\n", "only other possible outcome) is given by \n", "$$\n", "P(Y=0) = 1-P(Y=1) = 0.5\n", "$$\n", "\n", "Probabilities are often defined as the limit of the ratio between the\n", "number of positive outcomes (e.g. *heads*) given the number of trials.\n", "If the number of positive outcomes for event $y$ is denoted by $n$ and\n", "the number of trials is denoted by $N$ then this gives the ratio \n", "$$\n", "P(Y=y) = \\lim_{N\\rightarrow \\infty}\\frac{n_y}{N}.\n", "$$ \n", "In practice we never get to observe an event infinite times, so\n", "rather than considering this we often use the following estimate \n", "$$\n", "P(Y=y) \\approx \\frac{n_y}{N}.\n", "$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Movie Body Count Data\n", "\n", "To explore probabilities, we'll load in a data set. \n", "\n", "There is a crisis in the movie industry, deaths are\n", "occuring on a massive scale. In every feature film the body count is tolling up.\n", "But what is the cause of all these deaths? Let's try and investigate.}\n", "\n", "\\notes{For our first example of data science, we take inspiration from work by [researchers at NJIT](http://www.theswarmlab.com/r-vs-python-round-2/). They researchers were comparing the qualities of Python with R (my brief thoughts on the subject are available in a Google+ post here: https://plus.google.com/116220678599902155344/posts/5iKyqcrNN68). They put together a data base of results from the the \"Internet Movie Database\" and the [Movie Body Count](http://www.moviebodycounts.com/) website which will allow us to do some preliminary investigation.}\n", "\n", "\\notes{We will make use of data that has already been 'scraped' from the [Movie Body Count](http://www.moviebodycounts.com/) website. Code and the data is available at [a github repository](https://github.com/sjmgarnier/R-vs-\n", "Python/tree/master/Deadliest%20movies%20scrape/code). Git is a version control\n", "system and github is a website that hosts code that can be accessed through git.\n", "By sharing the code publicly through github, the authors are licensing the code\n", "publicly and allowing you to access and edit it. As well as accessing the code\n", "via github you can also [download the zip file](https://github.com/sjmgarnier/R-vs-Python/archive/master.zip)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For ease of use we've packaged this data set in the ```pods``` library\n", "\n", "### ```pods```\n", "\n", "The ```pods``` library is a library for supporting open data science (python open data science). It allows you to load in various data sets and provides tools for helping teach in the notebook.\n", "\n", "To install pods you can use pip:\n", "\n", "```pip install pods```\n", "\n", "The code is also available on github: \n", "\n", "Once ```pods``` is installed, it can be imported in the usual manner." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "import pods" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FilmYearBody_CountMPAA_RatingGenreDirectorActorsLength_MinutesIMDB_Rating
024 Hour Party People20027R[Biography, Comedy, Drama, Music][Michael Winterbottom][Steve Coogan, John Thomson, Paul Popplewell, ...1177.4
13:10 to Yuma200745R[Adventure, Crime, Drama, Western][James Mangold][Russell Crowe, Christian Bale, Logan Lerman, ...1227.8
230020060R[Action, Fantasy, History, War][Zack Snyder][Gerard Butler, Lena Headey, Dominic West, Dav...1177.8
38MM19997R[Crime, Mystery, Thriller][Joel Schumacher][Nicolas Cage, Joaquin Phoenix, James Gandolfi...1236.4
4The Abominable Dr. Phibes197110PG-13[Fantasy, Horror][Robert Fuest][Vincent Price, Joseph Cotten, Hugh Griffith, ...947.2
\n", "
" ], "text/plain": [ " Film Year Body_Count MPAA_Rating \\\n", "0 24 Hour Party People 2002 7 R \n", "1 3:10 to Yuma 2007 45 R \n", "2 300 2006 0 R \n", "3 8MM 1999 7 R \n", "4 The Abominable Dr. Phibes 1971 10 PG-13 \n", "\n", " Genre Director \\\n", "0 [Biography, Comedy, Drama, Music] [Michael Winterbottom] \n", "1 [Adventure, Crime, Drama, Western] [James Mangold] \n", "2 [Action, Fantasy, History, War] [Zack Snyder] \n", "3 [Crime, Mystery, Thriller] [Joel Schumacher] \n", "4 [Fantasy, Horror] [Robert Fuest] \n", "\n", " Actors Length_Minutes \\\n", "0 [Steve Coogan, John Thomson, Paul Popplewell, ... 117 \n", "1 [Russell Crowe, Christian Bale, Logan Lerman, ... 122 \n", "2 [Gerard Butler, Lena Headey, Dominic West, Dav... 117 \n", "3 [Nicolas Cage, Joaquin Phoenix, James Gandolfi... 123 \n", "4 [Vincent Price, Joseph Cotten, Hugh Griffith, ... 94 \n", "\n", " IMDB_Rating \n", "0 7.4 \n", "1 7.8 \n", "2 7.8 \n", "3 6.4 \n", "4 7.2 " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pods.datasets.movie_body_count()['Y']\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once it is loaded in the data can be summarized using the `describe` method in pandas." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearBody_CountLength_MinutesIMDB_Rating
count421.000000421.000000421.000000421.000000
mean1996.49168653.287411115.4275536.882898
std10.91321082.06803521.6522871.110788
min1949.0000000.00000079.0000002.000000
25%1991.00000011.000000100.0000006.200000
50%2000.00000028.000000111.0000006.900000
75%2005.00000061.000000127.0000007.700000
max2009.000000836.000000201.0000009.300000
\n", "
" ], "text/plain": [ " Year Body_Count Length_Minutes IMDB_Rating\n", "count 421.000000 421.000000 421.000000 421.000000\n", "mean 1996.491686 53.287411 115.427553 6.882898\n", "std 10.913210 82.068035 21.652287 1.110788\n", "min 1949.000000 0.000000 79.000000 2.000000\n", "25% 1991.000000 11.000000 100.000000 6.200000\n", "50% 2000.000000 28.000000 111.000000 6.900000\n", "75% 2005.000000 61.000000 127.000000 7.700000\n", "max 2009.000000 836.000000 201.000000 9.300000" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In jupyter and jupyter notebook it is possible to see a list of all possible\n", "functions and attributes by typing the name of the object followed by . for\n", "example in the above case if we type data. it show the columns\n", "available (these are attributes in pandas dataframes) such as Body_Count, and\n", "also functions, such as .describe().\n", "\n", "For functions we can also see the\n", "documentation about the function by following the name with a question mark.\n", "This will open a box with documentation at the bottom which can be closed with\n", "the x button." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.describe?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The film deaths data is stored in an object known as a 'data frame'. Data frames\n", "come from the statistical family of programming languages based on `S`, the most\n", "widely used of which is\n", "[`R`](http://en.wikipedia.org/wiki/R_(programming_language)). The data frame\n", "gives us a convenient object for manipulating data. The describe method\n", "summarizes which columns there are in the data frame and gives us counts, means,\n", "standard deviations and percentiles for the values in those columns. To access a\n", "column directly we can write" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 2002\n", "1 2007\n", "2 2006\n", "3 1999\n", "4 1971\n", "5 1988\n", "6 1988\n", "7 1990\n", "8 2005\n", "9 1988\n", "10 2002\n", "11 1979\n", "12 2007\n", "13 2006\n", "14 1980\n", "15 2007\n", "16 1985\n", "17 1981\n", "18 2000\n", "19 1993\n", "20 1998\n", "21 1979\n", "22 2006\n", "23 2008\n", "24 1998\n", "25 1992\n", "26 1976\n", "27 2005\n", "28 2007\n", "29 2002\n", " ... \n", "391 1995\n", "392 2005\n", "393 2008\n", "394 2005\n", "395 2000\n", "396 1983\n", "397 1985\n", "398 2006\n", "399 2007\n", "400 2004\n", "401 2007\n", "402 2008\n", "403 2005\n", "404 2001\n", "405 2000\n", "406 2002\n", "407 1968\n", "408 1969\n", "409 2000\n", "410 2003\n", "411 2006\n", "412 2002\n", "413 2005\n", "414 1974\n", "415 2000\n", "416 2007\n", "417 1967\n", "418 2007\n", "419 2001\n", "420 1964\n", "Name: Year, Length: 421, dtype: int64\n" ] } ], "source": [ "print(data['Year'])\n", "#print(data['Body_Count'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows the number of deaths per film across the years. We can plot the data as follows." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# this ensures the plot appears in the web browser\n", "%matplotlib inline \n", "import matplotlib.pyplot as plt # this imports the plotting library in python}" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJztnX+MHOd537/P/SBPtyp/SDrJCkWaNCL7KFZ1Ld6RFyZKfFblWrozpSIO68ZVhFjAAVu3daKkCoX29IeEwNE1JztGAgaCZYNCXTuMYsCK4MSQdecaaCqFR1mWZcmOKLW2SNAWbctWkLS2mTz9Y97XO/vuO7szu7M7M+9+P8BgZt55Z+Z9Z3a/7/M+748RVQUhhJBwGSk6AYQQQvoLhZ4QQgKHQk8IIYFDoSeEkMCh0BNCSOBQ6AkhJHAo9IQQEjgUekIICRwKPSGEBM5Y0QkAgMsuu0x3795ddDIIIaRSnDp16ruqOtUpXimEfvfu3djY2Cg6GYQQUilE5Jtp4tF1QwghgUOhJ4SQwKHQE0JI4FDoCSEkcCj0hBASOBR6QshwsrICrK83h62vR+GBQaEnhAwns7PAkSMNsV9fj/ZnZ4tNVx8oRT96QggZOPPzwIkTkbjX68CxY9H+/HzRKcsdWvSEkOFlfj4S+fvui9YBijxAoSeEDDPr65Elv7wcrV2ffSBQ6Akhw4n1yZ84Adx7b8ONE6DYU+gJIcPJyZPNPnnrsz95sth09QFR1aLTgJmZGeWkZoQQkg0ROaWqM53ipbLoReQ3ReRrIvKciHxKRCZEZI+IPCUip0XkT0Rkk4m72eyfNsd395YVQgghvdBR6EVkB4D/CGBGVf8pgFEA7wVwP4APq+rPAngNwB3mlDsAvGbCP2ziEUIIKYi0PvoxABeJyBiASQDnALwDwCPm+HEAt5rtW8w+zPEbRETySS4hhJCsdBR6VT0L4PcBfAuRwP8QwCkAP1DVCybaGQA7zPYOAK+Ycy+Y+Jfmm2xCCCFpSeO62Y7ISt8D4GcA1AC8q9cbi8iSiGyIyMb58+d7vRwhhJAE0rhu/gWA/62q51X1JwA+A+DnAWwzrhwAuArAWbN9FsBOADDHtwL4nntRVX1QVWdUdWZqquMnDwkhhHRJGqH/FoA5EZk0vvYbADwPYB3Ae0yc2wF81mw/avZhjq9pGfpwEkLIkJLGR/8UokbVpwF81ZzzIIDfAXCniJxG5IN/yJzyEIBLTfidAI72Id2EEEJSwgFThBBSUXIdMEUIIaS6UOgJISRwKPSEEBI4FHpCCAkcCj0hhAQOhZ4QQgKHQk8IIYFDoSeEkMCh0BNCSOBQ6AkhJHAo9IQQEjgUekIICRwKPSGEBA6FnhBCAodCTwghgUOhJ4SQwKHQE0JI4FDoCSEkcCj0hBASOBR6QggJHAo9IYQEDoWeEEICh0JPCCGBQ6EnhJDAodATQkjgUOgJISRwKPSEEBI4FHpCCAkcCj0hhAQOhZ4QQgKHQk8IIYFDoSeEkMCh0BNCSOBQ6AkhJHAo9IQQEjgUekIICRwKPSGEBA6FnhBC2rGyAqyvN4etr0fhFSGV0IvINhF5RES+LiIviMjPicglIvK4iLxo1ttNXBGRj4rIaRF5VkSu628WCCGkj8zOAkeONMR+fT3an50tNl0ZSGvR/wGAv1TVaQBvBfACgKMAnlDVqwE8YfYB4CYAV5tlCcCxXFNMCCGDZH4eOHEiEvd77onWJ05E4RWho9CLyFYAvwjgIQBQ1R+r6g8A3ALguIl2HMCtZvsWAA9rxJMAtonIlbmnnBBCBsX8PFCvA/fdF60rJPJAOot+D4DzAD4hIl8WkY+JSA3AFap6zsT5NoArzPYOAK/Ezj9jwgghpJqsrwPHjgHLy9Ha9dmXnDRCPwbgOgDHVPVtAP4ODTcNAEBVFYBmubGILInIhohsnD9/PsuphBAyOKxP/sQJ4N57G26cCol9GqE/A+CMqj5l9h9BJPzfsS4Zs37VHD8LYGfs/KtMWBOq+qCqzqjqzNTUVLfpJ4SQ/nLyZLNP3vrsT54sNl0ZGOsUQVW/LSKviMhbVPUbAG4A8LxZbgfwe2b9WXPKowD+vYh8GsBBAD+MuXgIIaRa3HVXa9j8fKX89B2F3vAfAHxSRDYBeBnAryOqDZwQkTsAfBPAERP3cwBuBnAawN+buIQQQgoildCr6jMAZjyHbvDEVQAf6DFdhBAyeFZWov7xcWt9fT1y0/gs+4rAkbGEEGIJYHCUj7SuG0IICZ/44Kh6PepKWbHBUT5o0RNChpOkOWxOnqz04CgfFHpCyHCS5KYZG6v04CgfFHpCyHDim8Pm7ruBD32o0oOjfFDoCSHDizuHzYULlR8c5YONsYSQ4cWdw8bX8FqxwVE+aNETQoaTAOawSQuFnhAynAQwh01aJBrIWiwzMzO6sbFRdDIIIaRSiMgpVfXNWtAELXpCCAkcCj0hhAQOhZ4QQgKHQk8IIYFDoSeEkMCh0BNCSOBQ6AkhJHAo9IQQEjgUekIICRwKPSGEBA6FnhBCAodCTwghgUOhJ4SQwKHQE0LCJ+lD4CsrxaRnwFDoCSHhk/Qh8NnZYtM1IPgpQUJI+MQ/BF6vJ382MFBo0RNChgP3Q+BDIvIAhZ4QMiy4HwIP8NuwSVDoCSHhM0QfAvdBoSeEhM8QfQjcBz8OTgghFYUfByeEEAKAQk8IIcFDoSeEkMCh0BNCSOBQ6AkhJHAo9IQQEjgUekIICRwKPSGEBE5qoReRURH5sog8Zvb3iMhTInJaRP5ERDaZ8M1m/7Q5vrs/SSeEEJKGLBb9BwG8ENu/H8CHVfVnAbwG4A4TfgeA10z4h008QgghBZFK6EXkKgALAD5m9gXAOwA8YqIcB3Cr2b7F7MMcv8HEJ4QQUgBpLfqPALgLwD+a/UsB/EBVL5j9MwB2mO0dAF4BAHP8hyZ+EyKyJCIbIrJx/vz5LpNPCCGkEx2FXkQWAbyqqqfyvLGqPqiqM6o6MzU1leelCSGExEjzKcGfB3BYRG4GMAFgC4A/ALBNRMaM1X4VgLMm/lkAOwGcEZExAFsBfC/3lBNCCElFR4teVe9W1atUdTeA9wJYU9X3AVgH8B4T7XYAnzXbj5p9mONrWoa5kAkhZEjppR/97wC4U0ROI/LBP2TCHwJwqQm/E8DR3pJICCGkF9K4bn6Kqn4RwBfN9ssADnji/D8Av5JD2gghhOQAR8YSQkjgUOgJISRwKPSEEBI4FHpCCAkcCj0hhAQOhZ4QQgKHQk8IIYFDoSeEkMCh0BNCSOBQ6AkhJHAo9IQQEjgUekIICRwKPSGEBA6FnhDiZ2UFWF9vDltfj8KreJ9BUNK8UOgJIX5mZ4EjRxrCtb4e7c/OVu8+gxLgQT2zrKhq4cv+/fuVEFJC1tZUL7tMdXk5Wq+tVfM+9vr2uu5+P+7V72emqgA2NIXGFi7ySqEnpNwsL0dSsbxc7fsMUIAH9czSCj1dN4SQZNbXgWPHgOXlaO26P6p0n/l5oF4H7rsvWs/P538PYHDPLAtpSoN+L7ToCSkhg3J3DPo+/bToB+kiUlr0hJBeOXkSOHGiYfnOz0f7J09W7z62UfTECeDee6N1vNE0Lwb1zDIiUaFQLDMzM7qxsVF0MgghobKyEvV8ibtr1tcjAb7rruLS1SMickpVZzrGo9ATQkg1SSv0dN0QQkjgUOgJISRwKPSEEBI4FHpCSLGUdH6YkKDQE0KKxTc/zLvfDYyNNcej+HcNhZ4QUiy2r/mRI8A990Tre+8FPvSh8k0OVlEo9ISQ4nGnJ7jzzlbxjw9EqhIlcE1R6AkhxeObH2ZQc9P0mxJMXUyhJ4QUS9L0BA88kN/kYEVa1T7X1IBrJxR6Qkix+OaHufvuSODzmpumaKu64NoJhZ4QUix33dUqfBcuAI89lt/kYEVb1QVPXcy5bgghw8M990RW9fJyVFMYBHHX1Px8634PcK4bQgiJU5RVXYKpi2nREzJsBDplb1v6aFUXCS16Qoifohsmi6AEVnWR0KInZBix4l6vR26Milu2w0puFr2I7BSRdRF5XkS+JiIfNOGXiMjjIvKiWW834SIiHxWR0yLyrIhc13t2CCG5EspgJJKKNK6bCwB+S1WvATAH4AMicg2AowCeUNWrATxh9gHgJgBXm2UJwLHcU03IsNPrAKCCu/uRwdJR6FX1nKo+bbb/FsALAHYAuAXAcRPtOIBbzfYtAB42Hyl/EsA2Ebky95QTMsz04mcf1IeySWnI1BgrIrsBvA3AUwCuUNVz5tC3AVxhtncAeCV22hkTRgjJi14GAA15w+QwklroReRiAH8G4DdU9fX4MY1adDO16orIkohsiMjG+fPns5xKSBj06n7p1s/uG4k6P59/18oSzNqYmbRprljeUgm9iIwjEvlPqupnTPB3rEvGrF814WcB7IydfpUJa0JVH1TVGVWdmZqa6jb9hFSXXrs55uln74dwVbEbZ9o0Vy1vqtp2ASAAHgbwESf8vwI4araPAlgx2wsA/sKcNwfgrzvdY//+/UrIULK2pnrZZarLy9F6bS3beTa+u99tOvK6nnvdrPkrkrRpLkHeAGxoB31V1VRC/wuI3DLPAnjGLDcDuBRRb5sXAXwBwCXaKBj+CMBLAL4KYKbTPSj0pDDuv7/1D7q2FoUPiuXl6K+4vJz+nH6k2xWupaV87tFN/oombZoLzltuQj+IhUJPUpO3wPXLks16/7JYvHHhyuPZlC1/aRhGi34QC4WepKYfwlzUH7boQiYpPfHn0MuzKVv+0pA2zSXJG4WehEs/hLmIKngZ3Ebx+yYJV7fPpkz5S0vaNJckbxR6EjZ5CnMJquCFkyRcS0vleTYlEdcyQaEn4ZKnMJekCl5KyvZs0qbHVyAsLUWLe728C4kBF0YUehImeYtP2azEvNPTy/XK9mzs/TsV8r7fyNatqlu29L/QGnDhSKEng2cQwlBG8cmTvIWibFZ5HqRx2+XdsJyFAboCKfSkwaDE0ScqtZrq6mr/7x0SeQtFSG0QWfLiKxDiYf108QyocZ9CTxoM0qpz/4irq+FZlIMgb6Go4qAllyy/4zQWve+3mYeLhxY9hb4wBmnVuaISkkU5CGjR+0lbM83io7diHx8N7Iatrqa36Omjp9AXThqrrlc3T5KohGBRDoJehML37lZXVScnw61RpXW/tHPJ+EYD33ZbFHbbbdmeF3vdUOgLJa1V14vQJJ3rWkihiEw/6EUoqtBGkrcQLi1F1rprvbuinoT7v1haUq3XVUVUr78+Wtfr6dNHoafQF0ZW8e62ql8mi7IKvXP60ZWyF7fDIPD9Ficnuy+M1tYil8zWrVGet25VnZhId72kglFE9cYbI3m88cZo371elvzRdUOhHwjdCEperpaiBHfAf7iuyNsC79XtkIY83mfeDfZra6oXXRTl+aKL0l8vye1z8GCzRT83l94VdNNNAy1sKfSkQdY/ZyiNd1XIR96it7raLFJpLdGs6e21AO3UYO+bIjlJXJeWoloB0Kg9dvvufYWlrTFkadztZ2Ebg0JPGmT5c1bBEs5CFRqB8+qldOBAVCOIi0ytFoV3g7VO46yuRtfrpQBNyl+nKZJ9grtlS5THLVui8+JxOvWZ9xk7Se4v35w/Sfnod2Ebg0JPmkkrHhWbva8tZbfo2/nUuymgFhcbVu3ycsPKXVzsLn1WsKxQxfe7LUCzNNj73p8btrDQallv2RKFd1NTaudOcwsi33sahPssBoWetNKPGR/LavmXPX2qyUJar3fvdpicbLYmx8d763Xjs057KUCTGuxrNf+76mZ0q68DQK3W+lynp1sLwYWF1hrQ6qrq5s3NDb5btvgLp5tuar1PvR6F9wEKPWkm7z+n9Y2W1WKuQo3DZ9HX6933UrJ5vv766K99/fV+Ec3a4Bu/Xj8K0Cy/rzS/Y9/1FhYisY5b2ps3R9u+GoubFivu1j20aVPre9qypdGQ2+maOUGhJw16/XO2O78KPvCyE3+GSX7xtBZhGgs8S4Ove72FhcEUoGl99FnGeNiukzYv1srv5E+3BUf8PU1PR7UltwDduZO9bpIWCn2f6UeXuLTWVejkPZK4l143WXzqad5du+v1m7wnHLNWubXirSvGrQH58LljpqdVR0ebrfxarZE+TmpGoa8snXpFDKPY9/IcsjRMpiFrL5lOYtRr7aJMDfvWTeb2t19Y6FxjOXw4Oqdej/br9Wh/fLy54LCNwgM0gCj0JF/cH6+vn3PZfOAu/RKUbv/Y7dKTl0XoK0wmJ1st1H64F9IWgklpTNOOkPad2r728QbV8XFt8dED/jaNw4ebC4Rdu1T37o3i2+Xw4Uah2kubSAYo9CQ/QrHee50XpR396NGUh0XoE0Jrkebpkum1wb5bF1ba3+biYqsbCojEOk67GlDcxWOf4eho89r66OPMzUXH3faGHH53FHqSH1XowZKGeINevJtcrwVWtz1BkixPX7/wPAqjeFrybjBM22CfpRaTtsBL+/yz5NltIPdZ9CINcbeunMVFfyNy0qCuHqHQkwahCHUerK01+2l7EWXV9LWEtJan73oTE639vfP4ElKnWkgeU2ektdST2iXS1pS6iZe2TePQIf2pa0a14bMfHVW99tpo+9prGwVy0nNwp2nIAQo9aRCK6yUP0vzhsjyvLBZ4txaqTxx76Wrou4evzaXdQKYk4kKa9Gys1dut+HfKS9p4Bw+q10c/NtacFhHVbdtaB6KNjjaHHTqk+pa3tD6HpIZg+uiHnH5Y4Hn6fauKtbbTVKGzPK8scbu1UNNYy2lrJz4BT/sFpgMHknviuOlZXEyu7XSyrOv11n7qvTTkTk9Hg5zcmtLIiP7U725F3h0IZV00e/Y0r31W/t69rc/MHqvVorBarbmA6QEKfZXplwU+7IObsvq/szyvNHF79Tl3mg6g3bU6CeHaWnLDafwenaZt6FRQ+D7VV6+3Fjy99LrxFWRW0F3rfdeuSNzjlrz7/LdubQj7li2N7fHxZov+8GHVqSl/o2+7Pvc9QKGvOnlb4Gmq6qH77bPUlPK26Hvx5Y+PRyLiimPaWSTTpM9a1XFRX1iILGHXSp+ba/0Ck7Xo3Ty71ntSj59u5/dJwuf+8s0DZK1tu9ivScXTbH307rJvX+v0ELbbZrzBf2QkKgDccSh03RBV7W9f6okJf/U9z94dVSVLjerAgdaq/uRk68RYaeP53BhWaHwDdrqpnbTrcunOfOkT5cnJ5kbIdj7+iYlmS3ZiIrkPf9reOZ3wFVq2kTVuvU9PR9sTE1GciQn9aS+aePpqtcYxu1x8cbSOFxx2fIItROw7si6ePHs5GSj0VSdPiz6peuubkS8U330vQpHlXF//bJHWXjJp2wd87qXRUdU3vrFZVHbubG00TCqo3d/S9HRrobN5c3Rd1+L1ifLcXEMQbQHkK8hsY2X8NzY52WrlJ6XR537Ztas5b77ZJm36XJ+4O8DJunPceD4XT9ISLxjt78AWmnaxhYR7zamp1neVEQp9lemXj953n05dDavKoJ6havoPTVgr3gpEWivY9sGOW9FpR7cmFRw+6/3gwWZXhM/1YnuquN9UtZare02R5hkjJyYiIXbdUG4hYa+XRoTdwUjxuDY/PmG2Ye6gp+3bWy16G3fTpsa2645ZXY0Kv1qt+Zq2MddXc+gRCn2VGVS/d1d4BiX0g7K2827naDfRluurTUpPmm+bWn9y/L1YK9Fa0dYV0OkDF762gVqt4cKwi28mx3q9dYDZyEgkwq475vLLm68X92O7XRB9Ym17r1gOHmxY3K4F7gq4W3DYBt+4m8amz72vdcG46Y0XZPbYG97QvBZp7V1jCyg3jVbs41Z+Dv83Cj1pT1pXQhbSinAv1nbWc/PsaeS7t50R0fXV+p6Db1RuUq+UeIFgGwxd8Tl0qFWY0xZ4rksiSQhHRlrnYZ+YaM6H/Vyh22hpLVc33bVac3oOHYruE3+uIq1C74qltZgXF1v779uaR7wwcPvvz801es7Y69dqDRePLVTjhaJbQ3C7V9o8u9Z7vOeOfd7T083vKcuEcQYK/bBS5FD7LCLci7Wd9ty8LXrfNa0V7LoEbMOpZXGx2YqL+9Td6r/PZXHoUKs4+kTUva/P/bK42HAtuIsr9FbQ7bl2HvZ4TWJysiGKSYsrkK7rx3VDJfV08Vn5rqsryVXiulriaY5vizS7yZJqEu659r779jXXyLZvT34uPc43RKEfVtKKbRlmcuzF2u6m/3heYu8b7BMPm5trbSj1TWOwttbad31xsXVelImJ1kJicrJ1/pWpqdaeOBMTrX24rci7Qjgy0ny9kZHW3iJ797YOMpqYaBbKeF9z3xL3UQNRA+uBA83P0LWAOy0+X76bv3jBtXVrQ5g3bYrC4v53n6i7vvx2BZG79hUobj/8pBpZGyj0w0w/LNksdBLhtBNM9TIj4iALMjdsYaF1Slzb+Oq7ltsIWa+3WuC+75jWas2WtRU5V/RcV8vISCRqbmEyPd3s17YuDLdbpxWp+Hrbtmi9f3/zOmnx+bvj6Wl3rk9w04qwe4+JiWY3mfXZj49H8eI+907L9u3Nbh43H26egeaaQxfdmyn0VaFfDZN5+qazkKaQsf2XXTeB2wvBNkq63QD73f8/abKrnTv9DZubN7e6wGq1hoW4aVMkAD53jk/AXVeEr+0k6SPU9tnaZWqqteCwYyjiYXv3tvYV37u3IVzWurVdHOPxDh+Owvfta05PO5eFTwh9Vrl9fu65PhdKfHHzYhdfI6tbaHVKdy+LrxCLz35ZJaEH8C4A3wBwGsDRTvEzC31IszH2o2Gy2y8U9UravCws+P/Y7lwkvrwsLKRrW+jlN5I0zH/XrtZCZny8tVFtaanRl9v9M7uF29xc673TjG9oNxVB3Cq3Vnha14YbFhf7sbFGvlxxtPF91r9viTeA2gIk/rUmGy9ubecltr72BDfMTV+7xXXPZFlcF49rDHSgMKEHMArgJQBvArAJwFcAXNPunMxC30//axHk2TCZ9oMN/SCLuLq9PlZXk59Dp8E17Rp7u30Ovr7xST1nkkTYFUPXJ2tdKm4NwfX5+56hz/1lJwIDmv3nVoTjYTaeK3BugTA11Xw8PugoblXH42W1tvfta3W5iDQsa2vRX3xxq7XfybXSqZF4ZCS54dVXk8hzcdN++HB1et0A+DkAn4/t3w3g7nbndOW6KdoPnTd5NUxWpbazttY6mZRqelFP87x6/Y34Phq9ttZ5kNnSUiRmrqgvLHSeQCspLIn4c7A1JbfWkDQQyorgzp3R/uKiv1unK/7j481uKTtBl1t4xwsFV3Djz8EnerZQtL5/u3bFN4s17Y4d2Latdd4e1y3Vz2Viovk5dtG9uUihfw+Aj8X2bwPwh+3O6dpHX5QfOm8G0dWwTFjL1QqPFRNfA22vrqlufyNJo11tjxcgeZCZHaQUbyjdurXVnWOv6Vrv3XZRTZpCeHq6NS/uAKx6vTVvvmkIxsejc+LxbEN6Ut98n1/cbRy2DcRW9EZHG+fbNMZ9551qDbVas/slqbumr2Dcvbt78c7SeJs0PUQGSi/0AJYAbADY2OXOX5GGKgqcj3746Mv+LJIsT9dH7/uocpaPYXT7G2nn/04zyMznVvFZx77+42lnFU377n15sULj+tN9PXF8z9/XWBz3ycdF2O3/b33+8XRbYY+LnnXl+Hz+SfPQ+FwtPtdNvEF0ZKQ5f/ECwW049bUhtFt8Da/uvXvsVBC266aqAudjUNMBlAlfrxbf5FRJvuk8hTBt+myvmzQNwb57j4629oWfmGgVvbRpTPscfHmZmmqdHGzv3qhA6JS3pIF2tt97mnlo3PT4BpONjLR+uHvXrtaJwGyBcOhQtB8Xal8h4Y5kBZqvZ2uavnjxgigu3r7ahe983yjaOF38d4sU+jEALwPYE2uM3dfunKHudUP6Qz9+I2mv2W5OnE5hRf2Oe8nb2lr0Cb1+F94+JicbIm+xg8TcsFqtOcwOBotja19x7IfA4wVRu9qFe/74eLR0uncXpBV6ieLmi4jcDOAjpgfOx1X1d9vFn5mZ0Y2NjdzTQQghPbOyAszOAvPzjbCDB4GpKeCxxxphDzwAfOELwOc+N7CkicgpVZ3pGK8fQp8VCj0hhGQnrdCPDCIxhBBCioNCTwghgUOhJ4SQwKHQE0JI4FDoCSEkcErR60ZEzgP4Zo6XvAzAd3O8XpEwL+UklLyEkg9gOPPyRlWd6hSpFEKfNyKykabLURVgXspJKHkJJR8A89IOum4IISRwKPSEEBI4oQr9g0UnIEeYl3ISSl5CyQfAvCQSpI+eEEJIg1AtekIIIYbKCL2IfFxEXhWR52JhbxWR/yUiXxWRPxeRLSZ8t4j8XxF5xix/HDtnv4l/WkQ+KiJS5ryYY//MHPuaOT5RhrxkfCfvi72PZ0TkH0Xkn5chH13kZVxEjpvwF0Tk7tg57xKRb5i8HB10PrrIyyYR+YQJ/4qIvD12TtG/r50isi4iz5vf/gdN+CUi8riIvGjW2024mHSeFpFnReS62LVuN/FfFJHbB5mPLvMybd7Xj0Tkt51rZf+NpZnLuAwLgF8EcB2A52JhJwH8ktl+P4D7zPbueDznOn8NYA6AAPgLADeVPC9jAJ4F8FazfymA0TLkJUs+nPOuBfBShd/JrwL4tNmeBPB/zG9uFMBLAN6ExrcYril5Xj4A4BNm+3IApwCMlOG9ALgSwHVm+58A+BsA1wBYAXDUhB8FcL/ZvtmkU0y6nzLhlyD6RsYlALab7e0lz8vlAGYB/C6A345dp6vfWGUselX9EoDvO8FvBvAls/04gF9udw0RuRLAFlV9UqOn9jCAW/NOaycy5uWdAJ5V1a+Yc7+nqv9Qhrz08E7+DYBPA5V9JwqgJiJjAC4C8GMArwM4AOC0qr6sqj9GlMdb+p12l4x5uQbAmjnvVQA/ADBThveiqudU9Wmz/bcAXgCwA9EzPW6iHY+l6xYAD2vEkwC2mXz8SwCPq+r3VfU1RPl/1wCzkjkvqvqqqp4E8BPnUl39xioj9Al8DY1M/gqAnbFje0TkyyLyP0TkehO2A8CZWJwzJqwMJOXlzQBURD67W6QpAAADGklEQVQvIk+LyF0mvKx5afdOLP8awKfMdlnzASTn5REAfwfgHIBvAfh9Vf0+onS/Eju/Cnn5CoDDIjImInsA7DfHSvVeRGQ3gLcBeArAFap6zhz6NoArzHbS8y/Ve0mZlyS6ykvVhf79AP6diJxCVB36sQk/B2CXqr4NwJ0A/rvEfN4lJSkvYwB+AcD7zPpficgNxSQxFUn5AACIyEEAf6+qz/lOLhlJeTkA4B8A/AyiT2b+loi8qZgkpiYpLx9HJBYbiL4K91eI8lYaRORiAH8G4DdU9fX4MVPbqEzXwaLyMtaPiw4KVf06ItcGROTNABZM+I8A/MhsnxKRlxBZxmcBXBW7xFUmrHCS8oLoT/glVf2uOfY5RP7X/4YS5qVNPizvRcOaB6r5Tn4VwF+q6k8AvCoi/xPADCJLK16DKX1eVPUCgN+08UTkrxD5j19DCd6LiIwjEsZPqupnTPB3RORKVT1nXDOvmvCz8D//swDe7oR/sZ/p9pExL0kk5bEtlbboReRysx4B8F8A/LHZnxKRUbP9JgBXA3jZVJFeF5E504Pg1wB8tpDEOyTlBcDnAVwrIpPGJ/xLAJ4va17a5MOGHYHxzwOR7xIlzAfQNi/fAvAOc6yGqOHv64gaPK8WkT0isglRofbooNPto81/ZdLkASJyI4ALqlqK35e570MAXlDVB2KHHgVge87cHkvXowB+zfS+mQPwQ5OPzwN4p4hsN71a3mnCBkYXeUmiu9/YIFuee2y1/hQil8xPEFm5dwD4ICLr428A/B4aA8B+GZFP8hkATwN4d+w6MwCeQ9Ry/Yf2nLLmxcT/tyY/zwFYKUteusjH2wE86blOpd4JgIsB/Kl5J88D+E+x69xs4r8E4D9X4L+yG8A3EDUOfgHRbIileC+IXJWKqNfZM2a5GVHPsycAvGjSfImJLwD+yKT3qwBmYtd6P4DTZvn1At5J1ry8wby71xE1kJ9B1Dje1W+MI2MJISRwKu26IYQQ0hkKPSGEBA6FnhBCAodCTwghgUOhJ4SQwKHQE0JI4FDoCSEkcCj0hBASOP8f7dXzhaQslp4AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(data['Year'], data['Body_Count'], 'rx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may be curious what the arguments we give to plt.plot are for, now is the perfect time to look at the documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We immediately note that some films have a lot of deaths, which prevent us seeing the detail of the main body of films. First lets identify the films with the most deaths." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FilmYearBody_CountMPAA_RatingGenreDirectorActorsLength_MinutesIMDB_Rating
60Dip huet gaai tau1990214NaN[Crime, Drama, Thriller][John Woo][Tony Leung Chiu Wai, Jacky Cheung, Waise Lee,...1367.7
117Equilibrium2002236R[Action, Drama, Sci-Fi, Thriller][Kurt Wimmer][Christian Bale, Dominic Purcell, Sean Bean, C...1077.6
154Grindhouse2007310R[Action, Horror, Thriller][Robert Rodriguez, Eli Roth, Quentin Tarantino...[Kurt Russell, Zoë Bell, Rosario Dawson, Vanes...1917.7
159Lat sau san taam1992307R[Action, Crime, Drama, Thriller][John Woo][Yun-Fat Chow, Tony Leung Chiu Wai, Teresa Mo,...1288.0
193Kingdom of Heaven2005610R[Action, Adventure, Drama, History, War][Ridley Scott][Martin Hancock, Michael Sheen, Nathalie Cox, ...1447.2
206The Last Samurai2003558R[Action, Drama, History, War][Edward Zwick][Ken Watanabe, Tom Cruise, William Atherton, C...1547.7
222The Lord of the Rings: The Two Towers2002468PG-13[Action, Adventure, Fantasy][Peter Jackson][Bruce Allpress, Sean Astin, John Bach, Sala B...1798.8
223The Lord of the Rings: The Return of the King2003836PG-13[Action, Adventure, Fantasy][Peter Jackson][Noel Appleby, Alexandra Astin, Sean Astin, Da...2018.9
291Rambo2008247R[Action, Thriller, War][Sylvester Stallone][Sylvester Stallone, Julie Benz, Matthew Marsd...927.1
317Saving Private Ryan1998255R[Action, Drama, War][Steven Spielberg][Tom Hanks, Tom Sizemore, Edward Burns, Barry ...1698.6
349Starship Troopers1997256R[Action, Sci-Fi][Paul Verhoeven][Casper Van Dien, Dina Meyer, Denise Richards,...1297.2
375Titanic1997307PG-13[Drama, Romance][James Cameron][Leonardo DiCaprio, Kate Winslet, Billy Zane, ...1947.7
382Troy2004572R[Adventure, Drama][Wolfgang Petersen][Julian Glover, Brian Cox, Nathan Jones, Adoni...1637.2
406We Were Soldiers2002305R[Action, Drama, History, War][Randall Wallace][Mel Gibson, Madeleine Stowe, Greg Kinnear, Sa...1387.1
\n", "
" ], "text/plain": [ " Film Year Body_Count \\\n", "60 Dip huet gaai tau 1990 214 \n", "117 Equilibrium 2002 236 \n", "154 Grindhouse 2007 310 \n", "159 Lat sau san taam 1992 307 \n", "193 Kingdom of Heaven 2005 610 \n", "206 The Last Samurai 2003 558 \n", "222 The Lord of the Rings: The Two Towers 2002 468 \n", "223 The Lord of the Rings: The Return of the King 2003 836 \n", "291 Rambo 2008 247 \n", "317 Saving Private Ryan 1998 255 \n", "349 Starship Troopers 1997 256 \n", "375 Titanic 1997 307 \n", "382 Troy 2004 572 \n", "406 We Were Soldiers 2002 305 \n", "\n", " MPAA_Rating Genre \\\n", "60 NaN [Crime, Drama, Thriller] \n", "117 R [Action, Drama, Sci-Fi, Thriller] \n", "154 R [Action, Horror, Thriller] \n", "159 R [Action, Crime, Drama, Thriller] \n", "193 R [Action, Adventure, Drama, History, War] \n", "206 R [Action, Drama, History, War] \n", "222 PG-13 [Action, Adventure, Fantasy] \n", "223 PG-13 [Action, Adventure, Fantasy] \n", "291 R [Action, Thriller, War] \n", "317 R [Action, Drama, War] \n", "349 R [Action, Sci-Fi] \n", "375 PG-13 [Drama, Romance] \n", "382 R [Adventure, Drama] \n", "406 R [Action, Drama, History, War] \n", "\n", " Director \\\n", "60 [John Woo] \n", "117 [Kurt Wimmer] \n", "154 [Robert Rodriguez, Eli Roth, Quentin Tarantino... \n", "159 [John Woo] \n", "193 [Ridley Scott] \n", "206 [Edward Zwick] \n", "222 [Peter Jackson] \n", "223 [Peter Jackson] \n", "291 [Sylvester Stallone] \n", "317 [Steven Spielberg] \n", "349 [Paul Verhoeven] \n", "375 [James Cameron] \n", "382 [Wolfgang Petersen] \n", "406 [Randall Wallace] \n", "\n", " Actors Length_Minutes \\\n", "60 [Tony Leung Chiu Wai, Jacky Cheung, Waise Lee,... 136 \n", "117 [Christian Bale, Dominic Purcell, Sean Bean, C... 107 \n", "154 [Kurt Russell, Zoë Bell, Rosario Dawson, Vanes... 191 \n", "159 [Yun-Fat Chow, Tony Leung Chiu Wai, Teresa Mo,... 128 \n", "193 [Martin Hancock, Michael Sheen, Nathalie Cox, ... 144 \n", "206 [Ken Watanabe, Tom Cruise, William Atherton, C... 154 \n", "222 [Bruce Allpress, Sean Astin, John Bach, Sala B... 179 \n", "223 [Noel Appleby, Alexandra Astin, Sean Astin, Da... 201 \n", "291 [Sylvester Stallone, Julie Benz, Matthew Marsd... 92 \n", "317 [Tom Hanks, Tom Sizemore, Edward Burns, Barry ... 169 \n", "349 [Casper Van Dien, Dina Meyer, Denise Richards,... 129 \n", "375 [Leonardo DiCaprio, Kate Winslet, Billy Zane, ... 194 \n", "382 [Julian Glover, Brian Cox, Nathan Jones, Adoni... 163 \n", "406 [Mel Gibson, Madeleine Stowe, Greg Kinnear, Sa... 138 \n", "\n", " IMDB_Rating \n", "60 7.7 \n", "117 7.6 \n", "154 7.7 \n", "159 8.0 \n", "193 7.2 \n", "206 7.7 \n", "222 8.8 \n", "223 8.9 \n", "291 7.1 \n", "317 8.6 \n", "349 7.2 \n", "375 7.7 \n", "382 7.2 \n", "406 7.1 " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data['Body_Count']>200]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we are using the command `data['Kill_Count']>200` to index the films in the pandas data frame which have over 200 deaths. To sort them in order we can also use the `sort` command. The result of this command on its own is a data series of `True` and `False` values. However, when it is passed to the\n", "`data` data frame it returns a new data frame which contains only those\n", "values for which the data series is `True`. We can also sort the result. To sort\n", "the result by the values in the `Kill_Count` column in *descending* order we use\n", "the following command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data['Body_Count']>200].sort_values(by='Body_Count', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now see that the 'Lord of the Rings' is a large outlier with a very large number of kills. We can try and determine how much of an outlier by histograming the data.\n", "\n", "### Plotting the Data" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5,1,'Histogram of Film Kill Count')" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEICAYAAABRSj9aAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFidJREFUeJzt3X+0XWV95/H3p6CoxCEiNoWADdpoF0pFzWJwbKc3ohWxNroWywVDBSqdOFNs1aGrBeeHzji2dJbo4C+WURyxMkbEH1C0tZjxDnUUlSg1/ByiBCHERAUCQeoY+M4fZwcO1yTn3nPvzbn3ue/XWmeds5+999nP/mbnc/Z5zj7npqqQJLXrl0bdAUnS7DLoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9AvUEluSDI26n6MUpLXJLkjyY4kz5/Cer+V5Ja+6U1JXjo7vXzMdp/e9XW/bno8yR92j89I8tXZ7oPmJ4O+QbsLnolBUFXPqarxAc+zLEkl2X+Wujpq7wLeWFWLquo7E2d2+/5AF647ktwLUFX/UFXPno0OTfy3S3JyknuS/HZV/aDr60NDPO/jk7w9ya3dPm1K8tEky2ay/7vZ7liSO2dzGxrMoNfIzIEXkF8FbhiwzPO6cF1UVYv3Rad2SXI68AHglVX1v6f5dJcBvwf8K+Ag4HnAeuD4aT6v5gGDfoHqP3NMcmySa5Pcl2Rrknd3i13d3d/bndG+KMkvJfkPSW5Psi3Jx5Mc1Pe8p3XzfpLkP07YztuTXJbkE0nuA87otv31JPcm2ZLk/Uke3/d8leSPujPR+5O8I8kzk3yt6++l/ctP2Mfd9jXJAUl2APsB/5jke1Os3R7PUrt9/HS3j/cn2ZDkWUnO7fpwR5LfmcQ23gCcD7y8qr7WtQ31Dqur/8uAVVX1raraWVXbq+oDVXVRt8xhSa5IcneSjUn+dd/6H0vyX/e0/92/8Z8m+W6S7Uk+leQJSQ4E/hY4rO9d0WFT6btmhkEvgAuAC6rqnwHPBC7t2v9ld7+4O6P9OnBGd1sJPANYBLwfIMlRwAeBU4FD6Z05Lp2wrVX0zi4XA5cADwFvAQ4BXkTvDPOPJqzzcuCFwHHAnwFrgN8HjgCeC5yyh/3abV+r6mdVtahb5nlV9cw9l2YorwL+GngK8B3gS/T+ry0F/gvwoQHr/9tuueOr6toZ6M9LgW9W1R17WWYtcCdwGHAS8BdJXjKFbbwWOAE4EvgN4IyqegB4BXBX37uiu4baA02LQd+uz3dnyfd2Y8sf3MuyPwd+LckhVbWjqq7Zy7KnAu+uqu9X1Q7gXODk7izzJOBvquqrVfX/gP8ETPwxpa9X1eer6uGqerCq1lfVNd1Z5iZ6IfjbE9b5b1V1X1XdAFwP/H23/e30zhj39EHq3vo6Wd/uq+N7J7nOP1TVl6pqJ/Bp4GnAeVX1c3qBuizJ3oaBXgZcA2yYQj/35qnAlj3NTHIE8GLgz6vqn6rqOuAjwGlT2MZ7q+quqrob+BvgmOl0WDPLoG/Xq6tq8a4bv3iW3O9M4FnAzUm+leR397LsYcDtfdO3A/sDS7p5j5w1VtVPgZ9MWP8xZ5XdsMaVSX7YDef8Bb2z+35b+x4/uJvpReze3vo6WS/oq+OfTHKdif37cd8HqA9293vqM/TO6J8FfCRJptDXPfkJvXdYe3IYcHdV3d/Xdju/+G5sb37Y9/in7H3/tI8Z9KKqbq2qU4BfBv4KuKwbX93dT5veRe9DzF2eDuykF25bgMN3zUjyRHpnk4/Z3ITpC4GbgeXd0NFbgZkIt0F9ncu20hvC+i32/k5ssr4MHJvk8D3Mvws4OMmT+9qeDmzuHj8APKlv3q9MYdv+PO4cYNCLJL+f5GlV9TBwb9f8MPCj7v4ZfYt/EnhLkiOTLKJ3Bv6pbpjiMuBVSf5F9wHp2xkc2k8G7gN2JPl1emezM2VvfZ3TurHs44ETkrxnms/1ZeAq4HNJXphk/yRPTvJvkry+G7v/GvCX3Yeov0HvXd4nuqe4DjgxycFJfgV48xQ2vxV4av8H9tr3DHpB70O0G7orUS4ATu7Gz38KvBP4P90Y9XHAR+l90Hg1cBvwT8AfA3Rj6H9Mbxx6C7AD2Ab8bC/b/lN6l/zdD3wY+NQM7tce+zofVNUPgJcAJyX5y2k+3UnAF+nVdzu9zzpW0Dvbh94H2svond1/Dnhb9wIBvRr+I7AJ+Hum8G9UVTfTe8H9fncMedXNCMQ/PKLZ0p1F30tvWOa2UfdHWqg8o9eMSvKqJE/qxvjfRe/KkU2j7ZW0sBn0mmmr6L39vwtYTm8YyLeN0gg5dCNJjfOMXpIaN+oflQLgkEMOqWXLlg217gMPPMCBBx44sx1qjDUazBoNZo0G29c1Wr9+/Y+r6mmDlpsTQb9s2TKuvXa4n/QYHx9nbGxsZjvUGGs0mDUazBoNtq9rlOT2wUs5dCNJzTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY2bE9+MnY4Nm7dzxjlfGHr9Tee9cgZ7I0lzj2f0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUuIFBn+SIJF9JcmOSG5K8qWt/e5LNSa7rbif2rXNuko1Jbkny8tncAUnS3k3mJxB2AmdX1beTPBlYn+Sqbt57qupd/QsnOQo4GXgOcBjw5STPqqqHZrLjkqTJGXhGX1Vbqurb3eP7gZuApXtZZRWwtqp+VlW3ARuBY2eis5KkqUtVTX7hZBlwNfBc4N8BZwD3AdfSO+u/J8n7gWuq6hPdOhcBf1tVl014rtXAaoAlS5a8cO3atUPtwLa7t7P1waFWBeDopQcNv/I8sWPHDhYtWjTqbsxp1mgwazTYvq7RypUr11fVikHLTfrXK5MsAj4DvLmq7ktyIfAOoLr784HXT/b5qmoNsAZgxYoVNTY2NtlVH+N9l1zO+RuG/xHOTacOt935ZHx8nGHru1BYo8Gs0WBztUaTuuomyePohfwlVfVZgKraWlUPVdXDwId5dHhmM3BE3+qHd22SpBGYzFU3AS4Cbqqqd/e1H9q32GuA67vHVwAnJzkgyZHAcuCbM9dlSdJUTGbM48XA64ANSa7r2t4KnJLkGHpDN5uANwBU1Q1JLgVupHfFzllecSNJozMw6Kvqq0B2M+uLe1nnncA7p9EvSdIM8ZuxktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEDgz7JEUm+kuTGJDckeVPXfnCSq5Lc2t0/pWtPkvcm2Zjku0leMNs7IUnas8mc0e8Ezq6qo4DjgLOSHAWcA6yrquXAum4a4BXA8u62GrhwxnstSZq0gUFfVVuq6tvd4/uBm4ClwCrg4m6xi4FXd49XAR+vnmuAxUkOnfGeS5ImJVU1+YWTZcDVwHOBH1TV4q49wD1VtTjJlcB5VfXVbt464M+r6toJz7Wa3hk/S5YseeHatWuH2oFtd29n64NDrQrA0UsPGn7leWLHjh0sWrRo1N2Y06zRYNZosH1do5UrV66vqhWDltt/sk+YZBHwGeDNVXVfL9t7qqqSTP4Vo7fOGmANwIoVK2psbGwqqz/ifZdczvkbJr0bv2DTqcNtdz4ZHx9n2PouFNZoMGs02Fyt0aSuuknyOHohf0lVfbZr3rprSKa739a1bwaO6Fv98K5NkjQCk7nqJsBFwE1V9e6+WVcAp3ePTwcu72s/rbv65jhge1VtmcE+S5KmYDJjHi8GXgdsSHJd1/ZW4Dzg0iRnArcDr+3mfRE4EdgI/BT4gxntsSRpSgYGffehavYw+/jdLF/AWdPslyRphvjNWElqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxg0M+iQfTbItyfV9bW9PsjnJdd3txL555ybZmOSWJC+frY5LkiZnMmf0HwNO2E37e6rqmO72RYAkRwEnA8/p1vlgkv1mqrOSpKkbGPRVdTVw9ySfbxWwtqp+VlW3ARuBY6fRP0nSNO0/jXXfmOQ04Frg7Kq6B1gKXNO3zJ1d2y9IshpYDbBkyRLGx8eH6sSSJ8LZR+8cal1g6O3OJzt27FgQ+zkd1mgwazTYXK3RsEF/IfAOoLr784HXT+UJqmoNsAZgxYoVNTY2NlRH3nfJ5Zy/YfjXq02nDrfd+WR8fJxh67tQWKPBrNFgc7VGQ111U1Vbq+qhqnoY+DCPDs9sBo7oW/Twrk2SNCJDBX2SQ/smXwPsuiLnCuDkJAckORJYDnxzel2UJE3HwDGPJJ8ExoBDktwJvA0YS3IMvaGbTcAbAKrqhiSXAjcCO4Gzquqh2em6JGkyBgZ9VZ2ym+aL9rL8O4F3TqdTkqSZ4zdjJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMGBn2SjybZluT6vraDk1yV5Nbu/ilde5K8N8nGJN9N8oLZ7LwkabDJnNF/DDhhQts5wLqqWg6s66YBXgEs726rgQtnppuSpGENDPqquhq4e0LzKuDi7vHFwKv72j9ePdcAi5McOlOdlSRN3f5DrrekqrZ0j38ILOkeLwXu6Fvuzq5tCxMkWU3vrJ8lS5YwPj4+XEeeCGcfvXOodYGhtzuf7NixY0Hs53RYo8Gs0WBztUbDBv0jqqqS1BDrrQHWAKxYsaLGxsaG2v77Lrmc8zcMvxubTh1uu/PJ+Pg4w9Z3obBGg1mjweZqjYa96mbrriGZ7n5b174ZOKJvucO7NknSiAwb9FcAp3ePTwcu72s/rbv65jhge98QjyRpBAaOeST5JDAGHJLkTuBtwHnApUnOBG4HXtst/kXgRGAj8FPgD2ahz5KkKRgY9FV1yh5mHb+bZQs4a7qdkiTNHL8ZK0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1Lhp/4Wp+W7ZOV8Yet1N571yBnsiSbPDM3pJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuOm9YdHkmwC7gceAnZW1YokBwOfApYBm4DXVtU90+umJGlYM3FGv7KqjqmqFd30OcC6qloOrOumJUkjMhtDN6uAi7vHFwOvnoVtSJImKVU1/MrJbcA9QAEfqqo1Se6tqsXd/AD37JqesO5qYDXAkiVLXrh27dqh+rDt7u1sfXDYPZieo5ceNJoNT9GOHTtYtGjRqLsxp1mjwazRYPu6RitXrlzfN5qyR9P94+C/WVWbk/wycFWSm/tnVlUl2e0rSVWtAdYArFixosbGxobqwPsuuZzzN4zmb5xvOnVsJNudqvHxcYat70JhjQazRoPN1RpNa+imqjZ399uAzwHHAluTHArQ3W+bbiclScMbOuiTHJjkybseA78DXA9cAZzeLXY6cPl0OylJGt50xjyWAJ/rDcOzP/A/q+rvknwLuDTJmcDtwGun301J0rCGDvqq+j7wvN20/wQ4fjqdkiTNHL8ZK0mNM+glqXGjuS6xEcvO+cLQ624675Uz2BNJ2jPP6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc4/JTgi/hlCSfuKZ/SS1DiDXpIaZ9BLUuMMeklqnB/GzkNT/SD37KN3ckbfOn6YKy0sntFLUuMMeklqnEEvSY2btTH6JCcAFwD7AR+pqvNma1uamlF9WcsviUmjMStBn2Q/4APAy4A7gW8luaKqbpyN7WnfmU5Ya9/xRVX9ZuuM/lhgY1V9HyDJWmAVYNBrQZmPgbunPk+8emtPFuILxa6aTbZG/fZFvVJVM/+kyUnACVX1h93064B/XlVv7FtmNbC6m3w2cMuQmzsE+PE0ursQWKPBrNFg1miwfV2jX62qpw1aaGTX0VfVGmDNdJ8nybVVtWIGutQsazSYNRrMGg02V2s0W1fdbAaO6Js+vGuTJO1jsxX03wKWJzkyyeOBk4ErZmlbkqS9mJWhm6rameSNwJfoXV750aq6YTa2xQwM/ywA1mgwazSYNRpsTtZoVj6MlSTNHX4zVpIaZ9BLUuPmddAnOSHJLUk2Jjln1P0ZlSRHJPlKkhuT3JDkTV37wUmuSnJrd/+Urj1J3tvV7btJXjDaPdg3kuyX5DtJruymj0zyja4On+ouHCDJAd30xm7+slH2e19JsjjJZUluTnJTkhd5DD1Wkrd0/8euT/LJJE+YD8fRvA36vp9ZeAVwFHBKkqNG26uR2QmcXVVHAccBZ3W1OAdYV1XLgXXdNPRqtry7rQYu3PddHok3ATf1Tf8V8J6q+jXgHuDMrv1M4J6u/T3dcgvBBcDfVdWvA8+jVyuPoU6SpcCfACuq6rn0LjQ5mflwHFXVvLwBLwK+1Dd9LnDuqPs1F27A5fR+Z+gW4NCu7VDglu7xh4BT+pZ/ZLlWb/S+y7EOeAlwJRB632Dcf+LxRO9qsRd1j/fvlsuo92GW63MQcNvE/fQYekwtlgJ3AAd3x8WVwMvnw3E0b8/oebTou9zZtS1o3dvD5wPfAJZU1ZZu1g+BJd3jhVi7/w78GfBwN/1U4N6q2tlN99fgkfp087d3y7fsSOBHwP/ohrc+kuRAPIYeUVWbgXcBPwC20Dsu1jMPjqP5HPSaIMki4DPAm6vqvv551TutWJDX0ib5XWBbVa0fdV/msP2BFwAXVtXzgQd4dJgGWNjHEED3+cQqei+KhwEHAieMtFOTNJ+D3p9Z6JPkcfRC/pKq+mzXvDXJod38Q4FtXftCq92Lgd9LsglYS2/45gJgcZJdXxrsr8Ej9enmHwT8ZF92eATuBO6sqm9005fRC36PoUe9FLitqn5UVT8HPkvv2Jrzx9F8Dnp/ZqGTJMBFwE1V9e6+WVcAp3ePT6c3dr+r/bTuyonjgO19b8+bU1XnVtXhVbWM3nHyv6rqVOArwEndYhPrs6tuJ3XLN30mW1U/BO5I8uyu6Xh6PyvuMfSoHwDHJXlS939uV43m/nE06g84pvnhyInA/wW+B/z7UfdnhHX4TXpvqb8LXNfdTqQ3HrgOuBX4MnBwt3zoXbH0PWADvasIRr4f+6hWY8CV3eNnAN8ENgKfBg7o2p/QTW/s5j9j1P3eR7U5Bri2O44+DzzFY+gXavSfgZuB64G/Bg6YD8eRP4EgSY2bz0M3kqRJMOglqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4/4/7POM/KRRGZkAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data['Body_Count'].hist(bins=20) # histogram the data with 20 bins.\n", "plt.title('Histogram of Film Kill Count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Read on the internet about the following python\n", "libraries: `numpy`, `matplotlib`, `scipy` and `pandas`. What functionality does\n", "each provide python. What is the `pylab` library and how does it relate to the\n", "other libraries?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could try and remove these outliers, but another approach would be plot the logarithm of the counts against the year." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5,0,'year')" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEWCAYAAAB8LwAVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJztnXucHVWV738rb9OSFiUyih0CE0xCBhXIo8GL2ldxSLoTwUeID3zg3NYevcNccPIBnaiD4yjR5s7MRRBGcfTjaIgPGL0XRbEb8T2d8BJEHmEcHqKgQNDxhc66f+wquk6dXefsqtr1Ouf3/Xzqc/rsU7Vr76rqvWo99tqiqiCEEELizKm6AYQQQuoJBQQhhBArFBCEEEKsUEAQQgixQgFBCCHECgUEIYQQKxQQJDciskxEfikic4Pv14jIn1XdrqyIyI9E5EVVtyMJEXm1iHyl6naQ3ocCgjgTDJy/DoRBuD1dVe9W1Seq6h8KOOc1IvIbEfmFiDwqIntF5GwRWeip/n8Wkb/1UVdZqOq/qOqL89YjIioiKxJ+WyMi+0XkmbHyr4nI+/OemzQDCgiSls2BMAi3H5dwzreq6gEAngbgLADbAFwpIlLCufsSVb0FwAcBfDS8ziLyRgCHAHi3z3OJyDyf9RF/UECQ3IjI8uBttO0fXUReLyLfEpH/LSKPiMhdInJ8UH6PiDwgIq9zOY+q/qeqXgNgC4DjAIwG55gTaBX7ROTnIrJbRJ4cacNnROQnwRvxtSKyJigfB/BqANsDbeiLkdM9R0RuCo65TEQWBcccJCL/N+jLQyLyDRGx/h+JyD8EfQw1nxMivz1BRD4uIg+LyK0isl1E7o38HvbnFyLyAxE5JXZNvxn5riLyZhG5I2jXhyKD+goR+XrQj5+JyGVB+bXB4TcGfT/V0oX3AzgAwJ+LyMEAzgNwuqr+JqjjSBG5OrgOPxSRl0XatEVEbgj6freI7Ij8tiJo8xtE5G4ANJfVFAoIUgYbANwE4CkAPgVgF4B1AFYAeA2AC0Tkia6VqerdAPYACAfc/wngZADPB/B0AA8D+FDkkC8BOALAUwFcB+BfgnouCf7eGWhDmyPHbAVwEoDDADwLwOuD8rMA3AtgKYCDAbwdQFK+mhkAzwHw5KDfnwkFDYB3AVgO4HAAJwbXIcq+oH+DAP4GwCdF5GmJFwUYg7mmzwra/qdB+XtgBuADATwDwP8J+v684PdnB32/LF6hqj4G4A1BHZ8E8ElV/TYABPfrqwA+AXNdXw3gEhFZGRz+y6DsSQA2AzhDRMZip3gegFUIBD2pHxQQJC1XBG+pj4jIFY7H/LuqfizwUVwGYAjAuar6W1X9CoDfwQiLNPwYZuAFgDcDeIeq3quqv4Uxgbw81GhU9VJV/UXkt2eLyGCX+v9RVX+sqg8B+CLMQA8Aj8GYug5V1cdU9RuakNBMVT+pqj9X1d+r6iSAhQDCAXQrgL9T1YdV9V4A/xg79jPB+f8rGLzvALC+Q3vfr6qPBMJzOtbeQwE8XVV/o6rfTKzB3ofrAXwUwGoYYRjyEgC3q+ongv7tBXAFgJcHx02p6i1B+2+EeSl4fqz6d6nqr1T112naRMqDAoKk5WRVfVKwnex4zE8jf/8aAFQ1XuasQQQcAuCh4O9DAVweCi4AtwL4A4CDRWSuiLw/MNc8CuBHwTEHdan/J5G/fxVp3wcA3AngK4G57OykCkTkbYH5aH/QrsHIeZ8O4J7I7vfEjn1tYKIJ+/QnXdqc1N7tAATAv4nILSJyeoc6krgFwI9U9VeRskMBPDfysvAIgFNhhCdE5DgxAQYPish+AH9maf89ILWGAoI0DhEZAnAsgG8ERfcA2BgRXE9S1UWqeh+AV8G87b4IZoBeHlYTfKZKZxxoImep6uEwvpAzReSFljaeADM4bwVwoKo+CcD+yHnvhzH5hAxFjj0UwD8BeCuApwTH3hw5Nk17f6Kq/0NVnw7gTQAulITIpZTcA+BrsWv+RFV9a/D7LgCfAzCkqoMAPhJvf5LmReoDBQRpDCKyWESeD+BfAfwbgCuDnz4M4L3BwAoRWSoiLwl+OwDAbwH8HMBiAH8Xq/anMH4A1zaMBU5WgRnw/wDgvyy7HgDg9wAeBDBPRN4JYEnk990AzhGRA0XkEBhhEDIAI7geDM75BhgNIjUi8goRCQXRw0G9YXtT9T3GFwCsEZFXicj8YFsf8UEcAOAhVf2NiAzDRJ6RhkEBQZrABSLyC5gB7e9h3kxPUtVwoPsHmAHrK8F+34VxjAPGifofAO4D8IPgtygfBXBkCp/KEQCuhnHCfgfAhao6bdnvKgBfBnB7cP7foNWkci6Ms/vfg/o+CyPIoKo/ADAZ1P9TAEcB+JZD22ysA/A9EfklzDU6Q1XvCn57N4CPB33fmqZSVd0P4wh/DYw29BMA74PxswDABID3Bffj7TACkTQMoZZHSPWIyASAbaoad+QSUhnUIAipABF5mog8V8wcjpUw4bOXV90uQqJwBiMh1bAAwMUw8ywegXHqXlhpiwiJQRMTIYQQKzQxEUIIsdJoE9NBBx2ky5cvr7oZhBDSKPbu3fszVV3abb/aCAgRWQ3gDJjZll9T1Yu6HbN8+XLs2bOn8LYRQkgvISL/4bJfoSYmEblUTLbOm2PlJ4nIbSJyZ5iqQFVvVdU3w8w8fW6R7SKEENKdon0Q/wyTEfNxxKw69iEAGwEcCeCVInJk8NsWAP8PszNkCSGEVEShAkJVr8VsQrWQ9QDuVNW7VPV3MOF9Lwn2/4KqboRJE2xFRMZFZI+I7HnwwQeLajohhPQ9VfggDkFryoF7AWwQkRcAeCnMVP1EDSLI4X8JAKxdu5YxuoQQUhC1cVIHK4VdU3EzCCGEBFQxD+I+RFIbw6Q8vq+CdhBCSHp27gSmY/kZp6dNeY9RhYCYAXCEiBwmIgtg0gB/IU0FIrJZRC7Zv39/IQ0khJBE1q0Dtm6dFRLT0+b7unXVtqsAig5z/TRMyuKVInKviLxRVX8Pk/v+KpiVv3ar6i1p6lXVL6rq+OBgt1UjCSHEMyMjwO7dRii8853mc/duU95jFOqDUNVXJpRfCYayEkKaysgIMDEBvOc9wI4dPSkcAOZiIoSQ9ExPAxddZITDRRe1+yR6hEYKCPogCCGVEfocdu8Gzj131tzUg0KikQKCPghCSGXMzLT6HEKfxMxMte0qgEavB7F27Vplsj5CCEmHiOxV1bXd9mukBkEIIaR4Gikg6IMghJDiaaSAoA+CEEKKp5ECghBCak8PpOSggCCEkCLogZQcFBCEEJIGm2bwpjeZLc5LX9rolByNFBB0UhNCKsOmGezaBVx2Wbu2sG3bbEqOiYlGCQeA8yAIISQ9oQCYmDCpNnbvNuUuZTUQEq7zIGqzYBAhhDSGpGR90TKg1aw0MtI4MxMFBCGEpCWerC8c8KNlP/1pckoOCghCCOlBosn6Qs3g5JMBEeDyy1u1hW3bWo8Nf2sIjXRSE0JIKdgilnbtMtFJUc1g2zbg1FN7LoFfIzUIEdkMYPOKFSuqbgohpJcJI5ZCbWF6Gvj852cd0CEXX9x+bMO0BRuN1CCYaoMQUgp9tLyojUYKCEIIKY1oxFID5zLkgQKCEEI60SfLi9qggCCEkCT6aHlRGxQQhBCSRB8tL2qDqTYIIaTP6OklR5msjxBCiqeRAoJhroQQUjyNFBCEEEKKhwKCEEKIFQoIQgghViggCCGEWKGAIIT4xZYBdXralDftPL1yjoxQQBBC/GJbs3nrVlPetPP0yjmyoqqN3Y499lglhNSQqSnVgw5S3bHDfE5NNfc8vXKOCAD2qMMYW/kgn2ejgCCkxuzYYYaYHTuaf55eOUeAq4BopImJM6kJqTllZUAt4zy9co4suEiRum7UIAipIaG5JDSTxL836Ty9co4Y6GUNghBSY8rKgPqBDwDnnNN6nuFh4OyzW/fLExFURl9qnDGW2VwJIc0kulZDuF70KacAqsAVV8yWNXGZ0J07TRRTtM3T00ZobN+eu/qezuZKCCHW9aIvv9wIh6avIV2T0FcKCEJIc7GtF+1zDemqJrHZhF8Fgo4CghDSXGzRPz4jgqp8k/cp6LLi4smu68YoJkL6GFv0z+Cg6pIlfiOCSp7EVsZ5wSgmQohX6pYzyBb9c+qpwLZtfiOCqniTjzrXzz131txU8vwIRjERQtywRQ011QmchrCfExPGZFVGf2sSxUQBQUg/kXfgqWKwrJIeFYoMcyWEtJPX6VoHx2mZ1HgSWxlQQBDSNPL4AvKGT9Y1Z1BRbN/efm1GRtq1Ldd7Ujc/TjdcPNl12wBsBnDJihUrfDn1CWkOPnL3ZMkcajvv4sWqk5Pt+513nnu9cc47r70veessGtd7UkHeJRtgum9CCqTqQSxPCGTWY219npxUHRgoJqy04kE0Na7Xtaqw2QgUEIQUSR0GMV9agO95AuPj+YVnDQbRTLjekxLXfrBBAUFIlCLe+KscxHxqAT40n+iA50sIVTyIpoYaRL02CgjiTFFv/FUMYnXQXmztiQ54eQfBGgyiqaAPon4bBQRJhe9Bp6pBrGr/R/y8SQNeVuHpexAt43q5nqMm944CghAbvt74a/ImWDlJA974eHbh6XsQdb1XtvOOj5vNV1tsVCA0KCAIiePzjb8mb4KFtSdPfXUUni733tbuIpL/uZy3JkuOVj7I59koIHqEMgbbOg5aPvHdvzz11U14hrhoj0X4U1wo2VxJAUHyU9Y/um0wGhjwOwGrroOWT3rFx1IEafpiEyTRsqJMUSUGPFBAkPyU+dYd/weenOztN/6i8D3INC3U1Eaa59hFg7A9m3lNUdQgKCAaSZkPbnww6qU32DKgBmHHVXtM44MIhUR0cmC8bHKytj4bCgjij25vkT7MN0mDUS+8wZaBb5/B5KTJs9SrGpyrmaiT6cg2OfC000zZaafV2mdDAUH8kDX6I4t63e0trVcGpyLwHXXk2weUF9+D6Pi40Q7i2kJcGCQR/78YH1edmFAVUT3hBPM5MeHWPgoICohGktd260qd3mCb4MwuIqQ1q3mkLGzPYp5MslNTxnQ0OGj6PDioumiRW31JAlVE9cQTzbB64onme7w+177RxEQBUXvSDkQ+zUFVDdQV/LOmxvcbfx7ziCtFmCHzBjJMTak+4Qmmz094gnt9SeapDRtaNYjhYTeT1fi46thYqQKaAoKUS684NFWb0Rffg+XkZOvg5vLmm6W9eQVvt0CGNJlkQy0EmNVWs957m5ANNZRuTu+wrEgBHYMCguQnT/RHXQdWV5rgHPcV9VWGBpGnfd2O75ZJ1maKmpw0JqUlS8xx0cE8y71PMtPZUo7Y+lG0gI5BAUHy4zrwN8Fun4YmahC2wdKVjRuNQzVa38SEKc/Cxo32AXnjRv8J/GyBDC7a1cKF7QsdLVmiOjqaTSPp9L/SbeJdWQI6AgUE8UMTBkufNEEbSjNYujA6aoaCcFCfnDTfR0eztS98G47WF0b1+A5kSFrNLospyhYYYTMTpfFL2HwLca0izxyKjDROQAA4GcA/AbgMwItdjqGAKIlecD67Uvf2qdrf0CcmVOfPzybYxsfNQBs1twwMuId82oibTELh4FPwps0km2U+j22QHxtzS6uRJCjjQqiMhIAxaiEgAFwK4AEAN8fKTwJwG4A7AZwd++1AAB91qZ8CogR8axBNeEOvO7ZrmHfeQtxh62PZ0BNOMPWdcEK1eb3yaFc284/NPGWbQ2HzS9iEy/r1pjxKaJIriLoIiOcBOCYqIADMBbAPwOEAFgC4EcCRkd8nARzjUj8FRMEUNZj3m9mqCIoQ3FlCPpMo2en6OGlNUS7YtKH4HIqoBhCnW6K/vO3LQC0EhGkHlscExHEArop8PyfYBMB5AF7Upb5xAHsA7Fm2bFkxV48Y8rz1dTu2CVFCdcfn4ke2Ac+Wb8jleUgyrZQlJOLkeY7DGddRDWJw0LzxRwVq0mAeF+Sjo+3mpCVL2s1YBb801VlAvBzARyLfTwNwAYC/ALAXwIcBvNmlbmoQNaaT9kENIj8uTte0g6DNZNIthNR2/zpFMTWNJP/M6Gj7HIo4tusVHu8rvDYjjRMQWeqmgKg5NkFAH0R+bNcwj6MzjbO334S7TbtavDh5kI+ycqUxR0WZmFAdGvI3QS8jdRYQVhNTyjo3A7hkxYoV/q8c8Uv8ragJUUJFk/caFLEOdJxOgryfzIPhtY72eXTUTLKLawZx57PN1AaobtnSap6amKAPIvJ9HoC7ABwWcVKvyVI3NYia029vm64UqUX5GrzLEEJpzx0XoGVHRmUx58Ud3Fu2mPszMGDqGxgw3+OaRsEvTbUQEAA+DeB+AI8BuBfAG4PyTQBuD6KZ3pG1fgqIGkNTUmeKEJ5FC+TxcbuDNc98CRuuz06eZ8x3GplO9UXDfdevb0/xsWiRKS+RWgiIojcKiBrTS6akovri01RThkDOu4ZCGlyFXVah6NvhnlRffI2IZcuM3yEaFbV4cbuAKPj/p6cFBH0QpFSKGBh9v+1XZW4pUiN0FaDd5hkkXQeXvqQRvPH6QuEQ90EsWDCrUYTmpnh9BWtrPS0gwo0aBCkFWyRLp4lRLvU12fxWtJ8jLM+iQaSd4OfSlzRCMVpfUkqUuXPNbGzAfNoWKuqUbdYDFBCE+GRqym1ilAtp3nLrZqrzqUEkCUrbIJ+Ustu26qBrWo2sA3+3/nRyZo+PmyioeBRTUj+6zbXICAUEIT4JB6kC/lm7njevk7TstmSpMzpIu6aj6JR/qtuAnsd05GqKWrSofeAfGDBJFbvNWp+a8vtSEqOnBQR9EKRUQp9DQeq+0/mzDFC2AdS25KWrICkq/NTVZOXTcZ0nisn1utrMRAsWJCf6i89a92nWjNHTAiLcqEEUTN3MG1Xh6jAs8nplMXHYTDWu6xvkwffbeRSfmoELtns6NtY+US4paCGueSbNoYjPLxkbKzRijAKC5KfpzlRflBkzbyOPrdx2rE8/QhJZtZ6yNIM8pHm7dzET2a6DzefisR8UEMQPZQwmvYTvgdHH27jtrbuMSKQsC/QkDYJ1e1mJD/w2zWB0tN3fMH9++0p9ecx+GfEqIAA8F8BA8PdrAJwP4FCXY4vcKCBKop9y7/jAZ+hkXu3FZc3mPINsmvO6krSyW8mDaCLhQkBR05EtEmnOHLUu5SpSrblS/QuIm4L1Gp4N4HoAbwHwdZdji9woIGIU8VBRg0hHHnNQHpIif3yusdzt7b6T7yPNindFmeq64SqYJiZmBcOOHebTtuZ2mBI8qkEsXGi0iHiqDVsIb4EJ/HwLiOuCz3dG8ild53JsERujmBLwrYbXTa2vOz7MQT7J8yae9t67rJpmi/XP428o43m3LS86f76Z7BZPAR6PRAr3j/sgQh9DKGSSJvfl0cK64FtAfD1Iy307gD8CMAfA912OLXKjBmHB58DDKKZ0+HZmV43rs+Syn2195snJzs+SSwoN39llbdpQ3CEdvvHHw1LjbRkaMtlbo8JgyxZTHvdfJF2bgsy7vgXEHwE4E8AJwfdlAF7rcmyRGwVEAvQZ1JsmCV5fYaVheTRJXRqh0ymFhu/nPS6YJifb1+uOCwNb+0IfxPz5Zr/58/VxH0RU4ISLD8WvTVM0iLpuFBAW6DMgvnDVDFyFXXxthKQ1qqtyuNv6HGoQ0dxJoV+hm49laMgcA7T+PWdO67GDg6obNrRem9Cf0RAfxEsB3AFgP4BHAfwCwKMuxxa5UUDEaIrpgtQf347iNBqEa+hsGT6I0Ikczb46f76bwz1czzoUDOGxY2Ot+4X+mXgK8BrMg5gDN3YC2KKqg6q6RFUPUNUljsd6R0Q2i8gl+/fvr6oJ9WRmBti9GxgZMd9HRsz3mRl/59i5E5iebi2bnjblxD95rvemTcD557eWnX++Ke+G67O0bh2wdetsG6engbExYN681v127QI2bAC+9CVgxw7zec456Z7NffuAiy4yx190kanT5/Nu6/MJJ5i/TzsN+MY3zOfcucDUVPf6Lr4YeOELW8v++I+BM89sLbvsMuBZz2q9Nq97HXD11a37jYwA27dn61tWXKQIgG+57Ff2Rg2iAqillEue621bE7mTaScLNuezbY1lW+K6pPUNbH2Oz1Yu67lL6p9LRFYYDhv6IsLPeJqOcH2I+H2KaxoegQ8TE4xp6aUA/gHAZQBeGSl7qcsJitwoICqCfo58pE33nTb6J4qr3T9ru8N4/W4O1rS5heLPWJp1oH2Sxh8SZ3CwPRx27lxzveLXJj5fIgybLQhfAuJjHbZLXU5Q5EYBUQHhIBG3BdcxAqeupFmhLm30j43omsh5SBos48tqhkLIJS9UJ7JEJ5U5eS7aPtv8kuHhZM3AZb5EgXgREI/vBDzXpazsjQKiAkLTQEFpiPuCtNcwjxYQP3ZsLN8AmhR+6hqi6TK3Ic2Kcknty2KKSuNwj9+/MFQ1et758828h7h5av16ezSWS9bXmqbaaJs1bSsre6OAqICoPbiKtRF6Bdc3RlcNwja4hTbw+Bts3hQOtnkC3XwQNqHRbQZx1jZmFS6uml3S/0BS/6L3zpbqJCpgwvpsAqduYa4AjgNwFoB7YCbKhdu7AdzocoIiNwqICuh1E1NZk9jib4xJ//iuPgjbm/OcOeYNNsrEhInJzzKAugoD1+VBk96wwz7Gj9+4sXsbw3Wgo8/n2Jh5a49fL1tYqosvYOVK087oOSYmTHncdBTX4EZH26/32JiZXxG9NnPnqq5e3X5tXK6BA74ExPMBvAvA/cFnuJ0J4AiXExS5UUBURC87qcuI0kqjhbm2J220TRb7/thYe1QUYGzt8Ta7rDIXajgnnjj7mXeSWFRL2rFjdh5CPHJo0SL7fIRFi2ajihYssN+XsTHze/wcw8Nu5rc02l/82niKQPNtYjrUZb+yNjBZX3WUMYBWTdEC0HWFOtX8E9F8zj4OJ37Fbe95om3CgTCcaTwx0dqfLD6IgYHWt/YwxDauGdiy2obtCTfbgBwKxujkuVBgxDWpeIZX1wi0884z2lX02mzZUlsfxFIAHwBwJYCpcHM5tsiNGkQFNCmPUB6KzGdV1DVMcmb7nH3s4jtxzSIbmlaOOsrUd9RRrYLS1aFtO3c0cmtqytRpy6oaH7wHB7trEKG2Nm+e2W/ePNMX25rUWf09o6Om7sMOa/2MLzaUEd8C4isA3gjg1sDsdCmA81yOLXKjgCCFUHcTWqd02t00iPXrs9v3bQvluPpDbM7ZRYuMHyL+th9G8Lg4tG2Ddzzsdnh4dq5BvN3xUFXX+QhRLQJQXbWq/bqOjrr5PpLWvZ47t1WDWLCgnmtSA9gbfN4UKZtxObbIjQKCeKcJJjRbGwcG2s0ZNgdyngWDkhbKsZlhbEI2XhYOfNH6wgEx6+S0sI2hqSpqMoqfJ+4zcJ3MF2pq8fpsEWMuPgPb/QzrDDPAhp819UF8N/i8CsAogKMB7HM5tsiNAoJ4pykmNJc37KRV3FzWULANWosWGZOQ64xfm5kuWrZqlbY5YgHVpUuT70E309/Gje2CctkytTqVFyzofr1sprFVq8yAHR/Qh4ayz1eJ38/1600/omaxukUxPb4TMAZgEMCfAJgGsBcmeR8FBGkGTRn405DFRm87Non4oBWaflxCnF00iKmpZCe1S3s6aXXd1nQYHc2+sl6n5VPzzFqPX9cCJ6R6FRB13SggiDNVmY6KEkx5/CRZB1rXY119EOGgF3VSRwVRFJvDN+mNf2ys3XSU5IPweb1cZq0nLfka1+rS5q5KiW8N4pkAvgbg5uD7swD8tcuxRW4UEH1G3sG2CudzEYIpT51pjs3jKHYZGMfH7QvlhINl3ITjMvku7heJOpNdfCdJdNO4kjLn2rLaxssWLGg3WS1c2D5Pw6PG61tAfB3AegDXR8pudjm2yI0Cos/wMdgWGb6aRJKpJr6P6z9/HkGZZ91s34vYdEpH7irMbdc27sweHjYDcNZsqS5tCWdwRxkdNf6Kbm2ZO7ddQHjUFmz4FhAzwWdUQNzgcmwRGyfK9TFlmVZ843MuQhmU4bOxDarRkFtXYd7NGZ4nPYwPbS2ei2lgoH2uRYHrT9vwLSC+BOCPwwR9AF4O4Esuxxa5UYPoU7JoAVUOyq4OWzJLVg0i6dq61OdqGksjKG2TF+NzKGwTGgsW0L4FxOEArgbwKwD3AfhmHdJvUED0CGn+GbIOrFVFMXUSTFWYu8pYLyHvdXVNRWK7tgsX2p3Z8WR4rj4NVx9LJzNdFg0ib0bbLvhK1ndmbHsHgB3hd5cTFLlRQPQIrm/3TTDNxEkaTFzmIvg4j4tvIc/5XetL83Y+NGR35K5a1f34pUtN6ovosXPmmEE4ysREu93fdTJemj67+EMWLmxPJpgnj5MDvgTEu4LtUwDuAPBBAJMAbgfwSZcTFLlRQPQQWU0AZWgBvilC0KWp07d5y6U+W/uSZnVPTrotyGM7V2i+iU+Kc1mfIvw9yxwR1z7bopiSMsvalnKtkwbx+E7AtQAOiHw/AMC1LscWuVFA9BhVmFyqoA5zI3xf66yDalKbp6ZaJ7bZopOS3qhtNv6k82SZ5+Ha57w+DZv/optj3xHfAuI2AAsj3xcCuM3l2CI3Cogewvdbbb/i6+3XFdcFjTq1z1YWmlmiE9vi9vxOb+7RTKuuwiCN3T+rxusqIGz+i4MOmk1EmCXnUwTfAuIdAG6EWUnu3QBuAHCOy7FFbhQQPUITfQt1JKvZI8+17jSXwaV9SWW2BZXiWVqTzhGalcK0F+GM6+h5bDOV58xp93OEK8VluYa2/WxOdBGzelyU4WHjT7EJ3jxrlAd4T7UB4BgAZwTb0a7HFblRQPQIveJbqJI8zuI81zrPkqhJPoixsXZ/QxiJ1E2DCNdRiM+kjifmCxcRipe5ZGTNE3UXrjkRFX5hptZ4Btq44zra3zw5n7QAAVHHjQLCIxykq6eMGdJF4ZJlNT7QJq0VvXKlfd/4rGSbIEqa0Ww7jy2SzMPb+ePYJuiFuaLi5rPhYW1JWjg8nHw/66hB1HGjgPAIzTzV09R74Nu0lUbbyHttbIIt59v540TbHU3xEWpDgPkMtaWJ9xrBAAAVu0lEQVRQOAwNJdeZxqTXAQoIkh46iquninuQx5k6Pu6edTRN3+L75s1f5XIOVw3C9XpNThozUdScNDBgyqLnGBhoX386Ke15HaOY6rpRQBRAv4Sa1pmy74HrG7stvfboaLut3DbrOSRN34q8DrY+DwzYM8HGB2tb6vFO1yvaj1WrjD8k6ksJI67iPohOa2PkhAKCpIcaRPVUdQ9cIoySwkBdE83l0SB8XwebFrB+ffuEtTQzrl2u4ejo7KS/UKuwRTHZoqc80tMCgtlcC6Cp9u9eoqp70CnbqetEsm6J5mxv3Ul9s+ViyrM+gq09Q0NmLYooSYOyS59DomXr17dHSi1cOJv6I/RL2FaKSwo62LjRi6mtpwVEuFGD8EjVUTCk2oSCtuUtXVNRuGgaadaSsPk1Qtt9FuFpE7xhSKurWcdFUMbLhoftJqvh4e6px5NeFjwl8aOAIIS4MT7ebvZYtMg+ILtkHR0YyJ9ozuakTjNju1t9U1Pu62G7CEBbjqXBwVkhEXV6u5rPkvbzYH6jgCCEuBGadKKOU5tJJ2mJz6S02XkTzcXfsm2pJ9LUaTMJdQsttb3J2/o8OtqeUjx01kfDZtOaEZMc9Tkd+BQQhBB3soZ3JpmJ8k7msr0lx+scG3NzNE9OmvJuGoRtclqa5HrxNg8NqR5/fGubDz20XRi51kcNggKCkNLx8Xbusz7bW3boI4nWGXfwhuYtm93flmI7bXoLF6Jv9scfb/4+/njzW/x72mtAHwQFBOljqnJSp83ImqW+0VH3N/GktBy2VBsu6TJGR92imDZsSE6Ql6XPS5eqHnxwq5ayZk17pNRBB5mJclGe+lQj1KJMTppjGcVEAUH6kKrCXH2ft5MG4HIO2/E2bSHcx1e6jDyaT1IKjIEBIyiAWeETry+cRR0KifC7bf6Fp2eBAoKQJuLBvpyastapTrPMqi2KybVOF19Fkm9hbCx7qo3R0XbtY/VqM8wuWdIqBOKEQiG6n+1Z8HSvKCAIaSq9nO7EZ6oNm6Zh80GEb/LdNJKwrJsG0UnjirY59Gsce2zrZ1I4bSgclixJvgaetD0KCEKaSBUaRFmk6ZvLvq7pMpKimGxaiqsvxtY+WxTTli2tZVu22Gdru2oQaa9jAhQQhDSNqnwQZZCmb0Vdh26pMVzPa0tNYkslMm9ee1TUnDmqy5a11rdmjab2QXAeBAUE6TN6Od1Jmr4VcR1c3vhdU4pHne5hapKFC9v9FeEiQPGQW1vZmjWtx65ebeq1taVEDWIeCGkKO3cC69YBIyOzZdPTwMwMsH17de2qA3W6Nra2rFtn2hItGxlp/d6JXbvay1z7Nz0NbN0K7N49e85TTjFD9RVXzJZt3Qps29Z6bFIbRczxgPlctAg4+ujWfU45xZS97W3mPN/8JjAxAdx9d2vZ6Chw1lmtx37oQ6Zv8bYA7X2JfveNixSp60YNos/oZROMar7+1ena5G1L2jDXbuRZEKlTfa4J9446yux31FGzbc6afoNRTBQQpAO97MRVzde/Ol2bvG1xMQk1oX+2hIC2NCQl940CgvQu3Rx0Tbfl53FA1ilENmtbXNencL3PeZ4H26zupEWE4udYtcq09cQTWz/jPohwrkaWvmWEAoL0Ji5vWnUyt6SFGoTdCWxbn8I1L1Ge58E2Q9o2l8E2eIcCYvFi0+bFi2cFRLQtAwP2NCIFPsMUEKT3yBIqWYfB0hX6IGb3D4VCuD5FuF5FvM4iljuNkzUz7dSUEQrRY8NU4S7CwLVvGWicgABwOICPAvis6zEUEH2GTd2fnDTlNupkbnEhj1mhTma1vKafcFnN6P3r5FTuZpoJj+3mVO7Ul7hTOeuxtrYMDhoNIkoY5pqlzQ7UQkAAuBTAAwBujpWfBOA2AHcCODv2GwUEsdPrGkS/0el+ut6/+H62t/GoBhJqJNHEgd3aGKbvCLWA6IS4bsS1j4mJdu1o/ny1ph6fP7/72tUZqYuAeB6AY6ICAsBcAPsCjWEBgBsBHBn5nQKCJNPrPoh+o1O0Ulbfgs2EY/NppPFBDAyYY+O5nlyOjTukFy5sb0s4ezqMdtqypV2QuAo1B2ohIEw7sDwmII4DcFXk+zkAzol87yggAIwD2ANgz7L4lHXSH/R6FFO/Eb+fPqKTbGanLOaa0KwZT6uRZNa0HRtlbMzkhbK1Jbr8aZ42O1BnAfFyAB+JfD8NwAUAngLgw4F2cY5L3dQg+pA6m47S+kiaRlkpMPKce+NGY6KJ1jkxYU/W57ONaa5NvL7xcbsGkSY9ekoaJyCy1E0B0WfU3XSUZFJIux5zXfF9/bP4lLrtG9rvfS0lmtfc5bJf6IPoxwWD0pqY0mwUEH1GE0xHWUMim4JPDS7t/XR9k5+YaL0Hw8P502q4HJtV0xgaMlFL0WNXrzblWdrsQJ0FxDwAdwE4LOKkXpOyzs0ALlmxYoWXi0WIV7Isd9kkqgwfdj13VfegIbPgayEgAHwawP0AHgNwL4A3BuWbANwe+BvekbV+ahAZacKbeFPxrUGUtRyoqwM4bhd3TZHtox+u2ovLPXBtz8qV7bOmJybaF/057zy3xYZ8X9eM1EJAFL1RQGSk7rb8plKED6Iqu79tv3ioZVKZ73QXaY53vQd5fRpxoZHnvLZMta7XNSMUEKQzdY4GaipFRTH5vleu9dmibVzefl3rS9OPNFFMrvfAtT22jKy29mVdrjTvdc1ATwsI+iA80bRUFP2M73vlWl9V+5WFa3uicxR81Ffx9eppARFufa1B5LXpNlGD6FffSRUahO2NeNUq1dHR1v0mJ93nGLi+Obvezzz+lPFxM2nNxWdgm6OQVTOwnXdsrD3KyrV9GaGA6HXy2HSb6oNoarvzUJUPIinNtW095WhuojQ+iDyrxKXpi22/MMtqN59BKBzicxTC72n7l3TexYtb94un9PA8p4YCoh/I+mbZ5DfxJmo+eagyislmUx8ba40QGh3N/hbvw86ex5/i4jNYudIIg+h+W7bYo5hc+pd0Xtf9qEE4NJo+iFnqZtMtg37sc1XYrrXvOQZ572cee34ZvgDX8+ZpX0p6WkCEGzWIPnubVu3PPleF7Vr7nueR935m1SCmpvIdm6d9eco8QQHR69AeP2vTtcWaFzFRKwt5z1tVu23XemBA23wQeYRE3mc4jw9i4cJ238nChe1O+KVLVefObd8vHs1ku0+Tk+2+Bdt5589vz7s0MGDPH+VpAh0FRK/TZD9CVpL+CbM6ScsQqGUNgr6xXev16+1RTFnneZQlPG37jY21D8C2QXnuXLVOlBsebj+vTaDa0n3Hz7FgQbuAsAmrqE8j5/NAAUH6hzJMBWW0r6jjiR1Xs47LRLmk+vKcN027U9LTAoJOatJG3SdqleWIJelwdQyXMVEuzT3O+Tz0tIAIN2oQRFWpQZBsUIOggCA9Th5HJX0Q/Yvtutomti1apFYfRFxI5HkO00wY9PQ8UECQ/sDH+sVFkiZ5XFJKiKyL3dSJugVVdFovOsrgoEkxEiWcONetPtt9DifaRQmd41E2bLAvGLRxI6OYXDcKCFJ70rzx9bK2ULe++dY8XfdLSh9uCx92ic7LCAUEIXWhZPtybalb33z7rlz3s/k0bBMQC7xePS0gGMVEGkeJESq1pm59qyo9ty0qypbChKk2qEGQHocahKFufaMGQQFBSKXQB2GoW9/og3ASEHNACCmOmRlg925gZMR8Hxkx32dm8u3bNOrWN9f2+N5vagqYmAAuvNB8v/BCYNkyYOlS4MwzTdmZZwKjo8CaNZVfr/4SEDt3AtPTrWXT06acEMD/M7J9++w/ecjIiCl32XdmBli3rrXsTW8ym682loGtbx/4ADBvXmvZ+ecDmza51ZnnXl16KfCZz7SWvf3twAUXtJbt2wfs2uXWHhd++MNZ4RDylrcAl13WWvbAA0ZoRLn+euCaa/y1xQUXNaOuW2oTU93UXFI/6vaM2NqzZImJz69LG7NiW7UuTXbYPPcqydQTT6TnOoktT1tsx/rOnBsD9EEkUDdHGakfdXtGbO2pWxuzknd9iTzXweYsznOt87TFdqzvtTci9LSAQN4w17qF2pH6UbdnpMTVxkon7wp1ea6DLdy0TivP+V69L6CnBUS4UYMghVC3Z4QaRDLUIDJBAWGjbvZlUj/q9ozQB5EMfRCZoYCwUbeEYaR+1O0Z6eUEfmkSGdrIc69WrmzPyDo83J4gz/Va52lLGav3xXAVEGL2bSZr167VPXv2VN0MQghpFCKyV1XXdtuvv+ZBEEIIcYYCghBCiBUKCEJIdnopO8GmTWYmd5TVq80WxTbb23bs5s3Ahg2tZbZZ8K5lVVxXF0dFXTcm6yOkYuoW9ZUHW1RVUiI9m3PdFpG1eHH3qChbVFqaZUgzAEYxEUJKoVfmZKja5x64zkdwTdmdp8wTPS0gwAWDCKkXvTKrW9U+e9l1RrProj95yjzQ0wIi3KhBEFIDqEEkH0sNggKCkL6FPojkY3vAB8GJcoSQ7OzcadariK71MD1t1rGwrXlRZzZtAl70otmFe4DZCKZbb50tO/984OqrgSuv7Hzs5s1mXYfvfW+2LIxMuvji9GUer6vrRDkKCEII6TM4k5oQQkguKCAIIYRYoYAghBBihQKCEEKIFQoIQgghViggCCH1oowEgL2UZLBAKCAIIfVi3Tpg69bZAXx62nxft65Z5+gB5lXdAEIIaWFkBNi92wzYExPARReZ79HJeE04Rw/QSA1CRDaLyCX79++vuimEkCIYGTED93veYz6LGLjLOEfDaaSAUNUvqur44OBg1U0hhBTB9LR5q9+xw3zG/QVNOUfDaaSAIIT0MKE/YPdu4NxzZ01BPgfwMs7RA1BAEELqxcxMqz8g9BfMzDTrHD0Ak/URQkifwWR9hBBCckEBQQghxAoFBCGEECsUEIQQQqxQQBBCCLHS6CgmEXkQwH94rPIgAD/zWF+VsC/1o1f6AbAvdcW1L4eq6tJuOzVaQPhGRPa4hH41AfalfvRKPwD2pa747gtNTIQQQqxQQBBCCLFCAdHKJVU3wCPsS/3olX4A7Etd8doX+iAIIYRYoQZBCCHECgUEIYQQKz0vIETkUhF5QERujpQ9W0S+IyLfF5EvisiSoHy5iPxaRG4Itg9Hjjk22P9OEflHEZG69iP47VnBb7cEvy+qQz/S9kVEXh25HzeIyH+JyHMa2pf5IvLxoPxWETkncsxJInJb0Jezy+5Hhr4sEJGPBeU3isgLIsdU/b8yJCLTIvKD4Pk/Iyh/soh8VUTuCD4PDMolaOedInKTiBwTqet1wf53iMjryuxHxr6sCu7Xb0XkbbG60j9jqtrTG4DnATgGwM2RshkAzw/+Ph3Ae4K/l0f3i9XzbwCGAQiALwHYWON+zANwE4BnB9+fAmBuHfqRti+x444CsK8u9yTDfXkVgF3B34sB/Ch45uYC2AfgcAALANwI4Mia9+UtAD4W/P1UAHsBzKnDfQHwNADHBH8fAOB2AEcC2Ang7KD8bADnBX9vCtopQbu/F5Q/GcBdweeBwd8H1rwvTwWwDsB7AbwtUk+mZ6znNQhVvRbAQ7HiZwK4Nvj7qwBe1qkOEXkagCWq+l01V/sTAE723dZOpOzHiwHcpKo3Bsf+XFX/UId+BO3Jek9eCWAXUI97AqTuiwIYEJF5AJ4A4HcAHgWwHsCdqnqXqv4Opo8vKbrtcVL25UgAU8FxDwB4BMDaOtwXVb1fVa8L/v4FgFsBHAJzTT8e7PbxSLteAuATavgugCcF/fhTAF9V1YdU9WGY/p9UYldS90VVH1DVGQCPxarK9Iz1vIBI4BbMXpxXABiK/HaYiFwvIl8XkROCskMA3BvZ596grGqS+vFMACoiV4nIdSKyPSivaz+Azvck5FQAnw7+bmJfPgvgPwHcD+BuAB9U1Ydg2n1P5Pgm9OVGAFtEZJ6IHAbg2OC3Wt0XEVkO4GgA3wNwsKreH/z0EwAHB38nXf9a3RfHviSRqS/9KiBOB/DnIrIXRm37XVB+P4Blqno0gDMBfEoidv0aktSPeQD+G4BXB5+niMgLq2miM0l9AQCIyAYAv1LVm20H14ykvqwH8AcATwdwGICzROTwaproTFJfLoUZZPYA+HsA34bpW20QkScC+ByAv1TVR6O/BdpNY2L8q+rLvCIqrTuq+kMYMwxE5JkARoPy3wL4bfD3XhHZB/M2fh+AZ0SqeEZQVilJ/YD5x71WVX8W/HYljG35k6hhP4COfQnZhlntAajpPQE69uVVAL6sqo8BeEBEvgVgLcybXVRjqn1fVPX3AP5XuJ+IfBvGPv4wanBfRGQ+zID6L6r6+aD4pyLyNFW9PzAhPRCU3wf79b8PwAti5dcU2W4bKfuSRFIfO9KXGoSIPDX4nAPgrwF8OPi+VETmBn8fDuAIAHcFqtyjIjIcRGS8FsC/VtL4CEn9AHAVgKNEZHFg734+gB/UtR9Ax76EZVsR+B8AY5tF8/pyN4D/Hvw2AOMQ/SGMI/gIETlMRBbACMMvlN1uGx3+VxYHfYCInAjg96pai2csOO9HAdyqqudHfvoCgDAS6XWRdn0BwGuDaKZhAPuDflwF4MUicmAQJfTioKw0MvQliWzPWJke+So2mLfO+2GcNvcCeCOAM2Dedm4H8H7Mzih/GYzN9QYA1wHYHKlnLYCbYSIBLgiPqWM/gv1fE/TlZgA769KPjH15AYDvWuppVF8APBHAZ4L78gMAfxWpZ1Ow/z4A72jA/8pyALfBOE2vhkkfXYv7AmNWVZhIvhuCbRNMNN/XANwRtPnJwf4C4ENBe78PYG2krtMB3Blsb6jgnqTtyx8F9+5RmMCBe2GCBjI9Y0y1QQghxEpfmpgIIYR0hwKCEEKIFQoIQgghViggCCGEWKGAIIQQYoUCghBCiBUKCEIqJpycSUjdoIAgJAUicq6I/GXk+3tF5AwR+SsRmQnWE/ibyO9XiMjeIJf/eKT8lyIyKSI3Ajiu5G4Q4gQFBCHpuBQmfUSYfmIbTDbNI2CS8T0HwLEi8rxg/9NV9ViY2cV/ISJPCcoHYNYdeLaqfrPMDhDiSl8m6yMkK6r6IxH5uYgcDZNi+XqYBVpeHPwNmJQaR8Cso/AXInJKUD4UlP8cJvPp58psOyFpoYAgJD0fAfB6mLw3lwJ4IYD3qerF0Z3ELMP5IgDHqeqvROQaAIuCn3+jqrVKj01IHJqYCEnP5TAri62Dye55FYDTg5z9EJFDgiyogwAeDoTDKpjsrYQ0BmoQhKREVX8nItMAHgm0gK+IyGoA3zHZmfFLmGy6XwbwZhG5FSbz6XerajMhWWA2V0JSEjinrwPwClW9o+r2EFIUNDERkgIRORJmbYCvUTiQXocaBCGEECvUIAghhFihgCCEEGKFAoIQQogVCghCCCFWKCAIIYRY+f8MkFysHY3BvAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(data['Year'], data['Body_Count'], 'rx')\n", "ax = plt.gca() # obtain a handle to the current axis\n", "ax.set_yscale('log') # use a logarithmic death scale\n", "# give the plot some titles and labels\n", "plt.title('Film Deaths against Year')\n", "plt.ylabel('deaths')\n", "plt.xlabel('year')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note a few things. We are interacting with our data. In particular, we are\n", "replotting the data according to what we have learned so far. We are using the\n", "progamming language as a *scripting* language to give the computer one command\n", "or another, and then the next command we enter is dependent on the result of the\n", "previous. This is a very different paradigm to classical software engineering.\n", "In classical software engineering we normally write many lines of code (entire\n", "object classes or functions) before compiling the code and running it. Our\n", "approach is more similar to the approach we take whilst debugging. Historically,\n", "researchers interacted with data using a *console*. A command line window which\n", "allowed command entry. The notebook format we are using is slightly different.\n", "Each of the code entry boxes acts like a separate console window. We can move up\n", "and down the notebook and run each part in a different order. The *state* of the\n", "program is always as we left it after running the previous part." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Back to Probability\n", "\n", "Let's use the sum rule to compute the approximate probability that a\n", "film from the movie body count website has over 40 deaths." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "deaths = (data.Body_Count>40).sum() # number of positive outcomes (in sum True counts as 1, False counts as 0)\n", "total_films = data.Body_Count.count()\n", "\n", "prob_death = float(deaths)/float(total_films)\n", "print(\"Probability of deaths being greather than 40 is:\", prob_death)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question \n", "\n", "We now have an estimate of the probability a film has greater than 40\n", "deaths. The estimate seems quite high. What could be wrong with the\n", "estimate? Do you think any film you go to in the cinema has this\n", "probability of having greater than 40 deaths?\n", "\n", "Why did we have to use `float` around our counts of deaths and total\n", "films? What would the answer have been if we hadn't used the `float`\n", "command? If we were using Python 3 would we have this problem?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write your answer to Question here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conditioning\n", "\n", "When predicting whether a coin turns up head or tails, we might think\n", "that this event is *independent* of the year or time of day. If we\n", "include an observation such as time, then in a probability this is known\n", "as *condtioning*. We use this notation, $P(Y=y|T=t)$, to condition the\n", "outcome on a second variable (in this case time). Or, often, for a\n", "shorthand we use $P(y|t)$ to represent this distribution (the $Y=$ and\n", "$T=$ being implicit). Because we don't believe a coin toss depends on\n", "time then we might write that $$\n", "P(y|t) =\n", "p(y).\n", "$$ However, we might believe that the number of deaths is dependent on\n", "the year. For this we can try estimating $P(Y>40 | T=2000)$ and compare\n", "the result, for example to $P(Y>40|2002)$ using our empirical estimate\n", "of the probability." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for year in [2000, 2002]:\n", " deaths = (data.Body_Count[data.Year==year]>40).sum()\n", " total_films = (data.Year==year).sum()\n", "\n", " prob_death = float(deaths)/float(total_films)\n", " print(\"Probability of deaths being greather than 40 in year\", year, \"is:\", prob_death)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Compute the probability for the number of deaths being over 40 for each\n", "year we have in our `film_deaths` data frame. Store the result in a\n", "`numpy` array and plot the probabilities against the years using the\n", "`plot` command from `matplotlib`. Do you think the estimate we have\n", "created of $P(y|t)$ is a good estimate? Write your code and your written\n", "answers in the box below.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your answer to Question 5 here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise Answer Text\n", "\n", "Write your answer to the question in this box." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Notes for Exercise\n", "\n", "Make sure the plot is included in *this* notebook file (the `IPython`\n", "magic command `%matplotlib inline` we ran above will do that for you, it\n", "only needs to be run once per file).\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Terminology | Mathematical notation | Description\n", "------------|-----------------------|----------------------------------\n", "joint | $P(X=x, Y=y)$ | prob. that X=x *and* Y=y\n", "marginal | $P(X=x)$ | prob. that X=x *regardless of* Y\n", "conditional| $P(X=x\\vert Y=y)$ | prob. that X=x *given that* Y=y\n", "\n", "
\n", "The different basic probability distributions.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Pictorial Definition of Probability" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspired by lectures from Christopher Bishop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Definition of probability distributions.\n", "\n", "Terminology | Definition | Probability Notation\n", "------------------|------------------------------------------|------------------------------\n", "Joint Probability | $\\lim_{N\\rightarrow\\infty}\\frac{n_{X=3,Y=4}}{N}$ | $P\\left(X=3,Y=4\\right)$ \n", "Marginal Probability | $\\lim_{N\\rightarrow\\infty}\\frac{n_{X=5}}{N}$ | $P\\left(X=5\\right)$\n", "Conditional Probability | $\\lim_{N\\rightarrow\\infty}\\frac{n_{X=3,Y=4}}{n_{Y=4}}$ | $P\\left(X=3\\vert Y=4\\right)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notational Details\n", "\n", "Typically we should write out $P\\left(X=x,Y=y\\right)$, but in practice\n", "we often shorten this to $P\\left(x,y\\right)$. This looks very much like\n", "we might write a multivariate function, *e.g.* $$\n", " f\\left(x,y\\right)=\\frac{x}{y},\n", " $$ but for a multivariate function $$\n", "f\\left(x,y\\right)\\neq f\\left(y,x\\right).\n", "$$ However, $$\n", "P\\left(x,y\\right)=P\\left(y,x\\right)\n", "$$ because $$\n", "P\\left(X=x,Y=y\\right)=P\\left(Y=y,X=x\\right).\n", "$$ Sometimes I think of this as akin to the way in Python we can write\n", "'keyword arguments' in functions. If we use keyword arguments, the\n", "ordering of arguments doesn't matter.\n", "\n", "We've now introduced conditioning and independence to the notion of\n", "probability and computed some conditional probabilities on a practical\n", "example The scatter plot of deaths vs year that we created above can be\n", "seen as a *joint* probability distribution. We represent a joint\n", "probability using the notation $P(Y=y, T=t)$ or $P(y, t)$ for short.\n", "Computing a joint probability is equivalent to answering the\n", "simultaneous questions, what's the probability that the number of deaths\n", "was over 40 and the year was 2002? Or any other question that may occur\n", "to us. Again we can easily use pandas to ask such questions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year = 2000\n", "deaths = (data.Body_Count[data.Year==year]>40).sum()\n", "total_films = data.Body_Count.count() # this is total number of films\n", "prob_death = float(deaths)/float(total_films)\n", "print(\"Probability of deaths being greather than 40 and year being\", year, \"is:\", prob_death)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Product Rule\n", "\n", "This number is the joint probability, $P(Y, T)$ which is much *smaller*\n", "than the conditional probability. The number can never be bigger than\n", "the conditional probabililty because it is computed using the *product\n", "rule*. $$\n", "p(Y=y, X=x) = p(Y=y|X=x)p(X=x)\n", "$$ and $$p(X=x)$$ is a probability distribution, which is equal or less\n", "than 1, ensuring the joint distribution is typically smaller than the\n", "conditional distribution.\n", "\n", "The product rule is a *fundamental* rule of probability, and you must\n", "remember it! It gives the relationship between the two questions: 1)\n", "What's the probability that a film was made in 2002 and has over 40\n", "deaths? and 2) What's the probability that a film has over 40 deaths\n", "given that it was made in 2002?\n", "\n", "In our shorter notation we can write the product rule as $$\n", "p(y, x) = p(y|x)p(x)\n", "$$ We can see the relation working in practice for our data above by\n", "computing the different values for $t=2000$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_x = float((data.Year==2002).sum())/float(data.Body_Count.count())\n", "p_y_given_x = float((data.Body_Count[data.Year==2002]>40).sum())/float((data.Year==2002).sum())\n", "p_y_and_x = float((data.Body_Count[data.Year==2002]>40).sum())/float(data.Body_Count.count())\n", "\n", "print(\"P(x) is\", p_x)\n", "print(\"P(y|x) is\", p_y_given_x)\n", "print(\"P(y,x) is\", p_y_and_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Sum Rule\n", "\n", "The other *fundamental rule* of probability is the *sum rule* this tells\n", "us how to get a *marginal* distribution from the joint distribution.\n", "Simply put it says that we need to sum across the value we'd like to\n", "remove. \n", "$$\n", "P(Y=y) = \\sum_{x} P(Y=y, X=x)\n", "$$ \n", "Or in our shortened notation \n", "$$\n", "P(y) = \\sum_{x} P(y, x)\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Write code that computes $P(y)$ by adding $P(y, x)$ for all values of\n", "$x$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your answer to Exercise here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bayes’ Rule\n", "\n", "Bayes rule is a very simple rule, it's hardly worth the name of a rule\n", "at all. It follows directly from the product rule of probability.\n", "Because $P(y, x) = P(y|x)P(x)$ and by symmetry\n", "$P(y,x)=P(x,y)=P(x|y)P(y)$ then by equating these two equations and\n", "dividing through by $P(y)$ we have $$\n", "P(x|y) =\n", "\\frac{P(y|x)P(x)}{P(y)}\n", "$$ which is known as Bayes' rule (or Bayes's rule, it depends how you\n", "choose to pronounce it). It's not difficult to derive, and its\n", "importance is more to do with the semantic operation that it enables.\n", "Each of these probability distributions represents the answer to a\n", "question we have about the world. Bayes rule (via the product rule)\n", "tells us how to *invert* the probability.\n", "\n", "### Probabilities for Extracting Information from Data\n", "\n", "What use is all this probability in data science? Let's think about how\n", "we might use the probabilities to do some decision making. Let's load up\n", "a little more information about the movies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Now we see we have several additional features including the quality\n", "rating (`IMDB_Rating`). Let's assume we want to predict the rating given\n", "the other information in the data base. How would we go about doing it?\n", "\n", "Using what you've learnt about joint, conditional and marginal\n", "probabilities, as well as the sum and product rule, how would you\n", "formulate the question you want to answer in terms of probabilities?\n", "Should you be using a joint or a conditional distribution? If it's\n", "conditional, what should the distribution be over, and what should it be\n", "conditioned on?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write your answer to Exercise here\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Probabilistic Modelling\n", "\n", "This Bayesian approach is designed to deal with uncertainty arising from\n", "fitting our prediction function to the data we have, a reduced data set.\n", "\n", "The Bayesian approach can be derived from a broader understanding of\n", "what our objective is. If we accept that we can jointly represent all\n", "things that happen in the world with a probability distribution, then we\n", "can interogate that probability to make predictions. So, if we are\n", "interested in predictions, $\\dataScalar_*$ at future points input\n", "locations of interest, $\\inputVector_*$ given previously training data,\n", "$\\dataVector$ and corresponding inputs, $\\inputMatrix$, then we are\n", "really interogating the following probability density, $$\n", "p(\\dataScalar_*|\\dataVector, \\inputMatrix, \\inputVector_*),\n", "$$ there is nothing controversial here, as long as you accept that you\n", "have a good joint model of the world around you that relates test data\n", "to training data,\n", "$p(\\dataScalar_*, \\dataVector, \\inputMatrix, \\inputVector_*)$ then this\n", "conditional distribution can be recovered through standard rules of\n", "probability\n", "($\\text{data} + \\text{model} \\rightarrow \\text{prediction}$).\n", "\n", "We can construct this joint density through the use of the following\n", "decomposition: \n", "$$\n", "p(\\dataScalar_*|\\dataVector, \\inputMatrix, \\inputVector_*) = \\int p(\\dataScalar_*|\\inputVector_*, \\mappingMatrix) p(\\mappingMatrix | \\dataVector, \\inputMatrix) \\text{d} \\mappingMatrix\n", "$$\n", "\n", "where, for convenience, we are assuming *all* the parameters of the\n", "model are now represented by $\\parameterVector$ (which contains\n", "$\\mappingMatrix$ and $\\mappingMatrixTwo$) and\n", "$p(\\parameterVector | \\dataVector, \\inputMatrix)$ is recognised as the\n", "posterior density of the parameters given data and\n", "$p(\\dataScalar_*|\\inputVector_*, \\parameterVector)$ is the *likelihood*\n", "of an individual test data point given the parameters.\n", "\n", "The likelihood of the data is normally assumed to be independent across\n", "the parameters, \n", "$$\n", "p(\\dataVector|\\inputMatrix, \\mappingMatrix) \\prod_{i=1}^\\numData p(\\dataScalar_i|\\inputVector_i, \\mappingMatrix),$$\n", "\n", "and if that is so, it is easy to extend our predictions across all\n", "future, potential, locations, \n", "$$\n", "p(\\dataVector_*|\\dataVector, \\inputMatrix, \\inputMatrix_*) = \\int p(\\dataVector_*|\\inputMatrix_*, \\parameterVector) p(\\parameterVector | \\dataVector, \\inputMatrix) \\text{d} \\parameterVector.\n", "$$\n", "\n", "The likelihood is also where the *prediction function* is incorporated.\n", "For example in the regression case, we consider an objective based\n", "around the Gaussian density, \n", "$$\n", "p(\\dataScalar_i | \\mappingFunction(\\inputVector_i)) = \\frac{1}{\\sqrt{2\\pi \\dataStd^2}} \\exp\\left(-\\frac{\\left(\\dataScalar_i - \\mappingFunction(\\inputVector_i)\\right)^2}{2\\dataStd^2}\\right)\n", "$$\n", "\n", "In short, that is the classical approach to probabilistic inference, and\n", "all approaches to Bayesian neural networks fall within this path. For a\n", "deep probabilistic model, we can simply take this one stage further and\n", "place a probability distribution over the input locations, \n", "$$\n", "p(\\dataVector_*|\\dataVector) = \\int p(\\dataVector_*|\\inputMatrix_*, \\parameterVector) p(\\parameterVector | \\dataVector, \\inputMatrix) p(\\inputMatrix) p(\\inputMatrix_*) \\text{d} \\parameterVector \\text{d} \\inputMatrix \\text{d}\\inputMatrix_*\n", "$$ and we have *unsupervised learning* (from where we can get deep\n", "generative models)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graphical Models\n", "\n", "One way of representing a joint distribution is to consider conditional\n", "dependencies between data. Conditional dependencies allow us to\n", "factorize the distribution. For example, a Markov chain is a\n", "factorization of a distribution into components that represent the\n", "conditional relationships between points that are neighboring, often in\n", "time or space. It can be decomposed in the following form.\n", "$$p(\\dataVector) = p(\\dataScalar_\\numData | \\dataScalar_{\\numData-1}) p(\\dataScalar_{\\numData-1}|\\dataScalar_{\\numData-2}) \\dots p(\\dataScalar_{2} | \\dataScalar_{1})$$\n", "\n", "\n", "\n", "By specifying conditional independencies we can reduce the\n", "parameterization required for our data, instead of directly specifying\n", "the parameters of the joint distribution, we can specify each set of\n", "parameters of the conditonal independently. This can also give an\n", "advantage in terms of interpretability. Understanding a conditional\n", "independence structure gives a structured understanding of data. If\n", "developed correctly, according to causal methodology, it can even inform\n", "how we should intervene in the system to drive a desired result\n", "[@Pearl:causality95].\n", "\n", "However, a challenge arise when the data becomes more complex. Consider\n", "the graphical model shown below, used to predict the perioperative risk\n", "of *C Difficile* infection following colon surgery\n", "[@Steele:predictive12]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To capture the complexity in the interelationship between the data the\n", "graph becomes more complex, and less interpretable.\n", "\n", "Machine learning problems normally involve a prediction function and an\n", "objective function. So far in the course we've mainly focussed on the\n", "case where the prediction function was over the real numbers, so the\n", "codomain of the functions, $\\mappingFunction(\\inputMatrix)$ was the real\n", "numbers or sometimes real vectors. The classification problem consists\n", "of predicting whether or not a particular example is a member of a\n", "particular class. So we may want to know if a particular image\n", "represents a digit 6 or if a particular user will click on a given\n", "advert. These are classification problems, and they require us to map to\n", "*yes* or *no* answers. That makes them naturally discrete mappings.\n", "\n", "In classification we are given an input vector, $\\inputVector$, and an\n", "associated label, $\\dataScalar$ which either takes the value $-1$ to\n", "represent *no* or $1$ to represent *yes*.\n", "\n", "- Classifiying hand written digits from binary images (automatic zip\n", " code reading)\n", "- Detecting faces in images (e.g. digital cameras).\n", "- Who a detected face belongs to (e.g. Picasa, Facebook, DeepFace,\n", " GaussianFace)\n", "- Classifying type of cancer given gene expression data.\n", "- Categorization of document types (different types of news article on\n", " the internet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our focus has been on models where the objective function is inspired by\n", "a probabilistic analysis of the problem. In particular we've argued that\n", "we answer questions about the data set by placing probability\n", "distributions over the various quantities of interest. For the case of\n", "binary classification this will normally involve introducing probability\n", "distributions for discrete variables. Such probability distributions,\n", "are in some senses easier than those for continuous variables, in\n", "particular we can represent a probability distribution over\n", "$\\dataScalar$, where $\\dataScalar$ is binary, with one value. If we\n", "specify the probability that $\\dataScalar=1$ with a number that is\n", "between 0 and 1, i.e. let's say that $P(\\dataScalar=1) = \\pi$ (here we\n", "don't mean $\\pi$ the number, we are setting $\\pi$ to be a variable) then\n", "we can specify the probability distribution through a table.\n", "\n", "| $\\dataScalar$ | 0 | 1 | \n", "|:----------------:|:---------:|:-----:| \n", "| $P(\\dataScalar)$ | $(1-\\pi)$ | $\\pi$ |\n", "\n", "Mathematically we can use a trick to implement this same table. We can\n", "use the value $\\dataScalar$ as a mathematical switch and write that $$\n", " P(\\dataScalar) = \\pi^\\dataScalar (1-\\pi)^{(1-\\dataScalar)}\n", " $$ where our probability distribution is now written as a function of\n", "$\\dataScalar$. This probability distribution is known as the [Bernoulli\n", "distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution). The\n", "Bernoulli distribution is a clever trick for mathematically switching\n", "between two probabilities if we were to write it as code it would be\n", "better described as" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bernoulli(y_i, pi):\n", " if y_i == 1:\n", " return pi\n", " else:\n", " return 1-pi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we insert $\\dataScalar=1$ then the function is equal to $\\pi$, and if\n", "we insert $\\dataScalar=0$ then the function is equal to $1-\\pi$. So the\n", "function recreates the table for the distribution given above.\n", "\n", "The probability distribution is named for [Jacob\n", "Bernoulli](http://en.wikipedia.org/wiki/Jacob_Bernoulli), the swiss\n", "mathematician. In his book Ars Conjectandi he considered the\n", "distribution and the result of a number of 'trials' under the Bernoulli\n", "distribution to form the *binomial* distribution. Below is the page\n", "where he considers Pascal's triangle in forming combinations of the\n", "Bernoulli distribution to realise the binomial distribution for the\n", "outcome of positive trials." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pods\n", "pods.notebook.display_google_book(id='CF4UAAAAQAAJ', page='PA87')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thomas Bayes also described the Bernoulli distribution, only he didn't\n", "refer to Jacob Bernoulli's work, so he didn't call it by that name. He\n", "described the distribution in terms of a table (think of a *billiard\n", "table*) and two balls. Bayes suggests that each ball can be rolled\n", "across the table such that it comes to rest at a position that is\n", "*uniformly distributed* between the sides of the table.\n", "\n", "Let's assume that the first ball is rolled, and that it comes to reset\n", "at a position that is $\\pi$ times the width of the table from the left\n", "hand side.\n", "\n", "Now, we roll the second ball. We are interested if the second ball ends\n", "up on the left side (+ve result) or the right side (-ve result) of the\n", "first ball. We use the Bernoulli distribution to determine this.\n", "\n", "For this reason in Bayes's distribution there is considered to be\n", "*aleatoric* uncertainty about the distribution parameter." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "import pods\n", "from ipywidgets import IntSlider" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "62556f773c1d4b36b6b29265d504a8de", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=0, description='counter', max=9), Output()), _dom_classes=('widget-inter…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pods.notebook.display_plots('bayes-billiard{counter:0>3}.svg', \n", " directory='.', \n", " counter=IntSlider(0,0,9,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Maximum Likelihood in the Bernoulli Distribution\n", "\n", "Maximum likelihood in the Bernoulli distribution is straightforward.\n", "Let's assume we have data, $\\dataVector$ which consists of a vector of\n", "binary values of length $n$. If we assume each value was sampled\n", "independently from the Bernoulli distribution, conditioned on the\n", "parameter $\\pi$ then our joint probability density has the form $$\n", "p(\\dataVector|\\pi) = \\prod_{i=1}^{\\numData} \\pi^{\\dataScalar_i} (1-\\pi)^{1-\\dataScalar_i}.\n", "$$ As normal in maximum likelihood we consider the negative log\n", "likelihood as our objective, \n", "$$\\begin{align*}\n", " \\errorFunction(\\pi)& = -\\log p(\\dataVector|\\pi)\\\\ \n", " & = -\\sum_{i=1}^{\\numData} \\dataScalar_i \\log \\pi - \\sum_{i=1}^{\\numData} (1-\\dataScalar_i) \\log(1-\\pi),\n", "\\end{align*}$$\n", "\n", "and we can derive the gradient with respect to the parameter $\\pi$.\n", "$$\n", "\\frac{\\text{d}\\errorFunction(\\pi)}{\\text{d}\\pi} = -\\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\pi} + \\frac{\\sum_{i=1}^{\\numData} (1-\\dataScalar_i)}{1-\\pi},\n", "$$\n", "and as normal we look for a stationary point for the log likelihood by\n", "setting this derivative to zero,\n", "$$\n", "0 = -\\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\pi} + \\frac{\\sum_{i=1}^{\\numData} (1-\\dataScalar_i)}{1-\\pi},\n", "$$\n", "rearranging we form\n", "$$\n", "(1-\\pi)\\sum_{i=1}^{\\numData} \\dataScalar_i = \\pi\\sum_{i=1}^{\\numData} (1-\\dataScalar_i),\n", "$$\n", "which implies\n", "$$\n", "\\sum_{i=1}^{\\numData} \\dataScalar_i = \\pi\\left(\\sum_{i=1}^{\\numData} (1-\\dataScalar_i) + \\sum_{i=1}^{\\numData} \\dataScalar_i\\right),\n", "$$\n", "and now we recognise that\n", "$\\sum_{i=1}^{\\numData} (1-\\dataScalar_i) + \\sum_{i=1}^{\\numData} \\dataScalar_i = \\numData$\n", "so we have\n", "$$\n", "\\pi = \\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\numData}\n", "$$\n", "so in other words we estimate the probability associated with the\n", "Bernoulli by setting it to the number of observed positives, divided by\n", "the total length of $\\dataScalar$. This makes intiutive sense. If I\n", "asked you to estimate the probability of a coin being heads, and you\n", "tossed the coin 100 times, and recovered 47 heads, then the estimate of\n", "the probability of heads should be $\\frac{47}{100}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Show that the maximum likelihood solution we have found is a *minimum*\n", "for our objective." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write your answer to Exercise here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this box for any code you need for the exercise\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bayes Theorem Reminder\n", "$$\n", "\\text{posterior} =\n", "\\frac{\\text{likelihood}\\times\\text{prior}}{\\text{marginal likelihood}}\n", "$$\n", "\n", "Four components:\n", "\n", "1. Prior distribution\n", "2. Likelihood\n", "3. Posterior distribution\n", "4. Marginal likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive Bayes Classifiers\n", "\n", "In probabilistic machine learning we place probability distributions (or\n", "densities) over all the variables of interest, our first classification\n", "algorithm will do just that. We will consider how to form a\n", "classification by making assumptions about the *joint* density of our\n", "observations. We need to make assumptions to reduce the number of\n", "parameters we need to optimise.\n", "\n", "In the ideal world, given label data $\\dataVector$ and the inputs\n", "$\\inputMatrix$ we should be able to specify the joint density of all\n", "potential values of $\\dataVector$ and $\\inputMatrix$,\n", "$p(\\dataVector, \\inputMatrix)$. If $\\inputMatrix$ and $\\dataVector$ are\n", "our training data, and we can somehow extend our density to incorporate\n", "future test data (by augmenting $\\dataVector$ with a new observation\n", "$\\dataScalar^*$ and $\\inputMatrix$ with the corresponding inputs,\n", "$\\inputVector^*$), then we can answer any given question about a future\n", "test point $\\dataScalar^*$ given its covariates $\\inputVector^*$ by\n", "conditioning on the training variables to recover, $$\n", "p(\\dataScalar^*|\\inputMatrix, \\dataVector, \\inputVector^*),\n", "$$\n", "\n", "We can compute this distribution using the product and sum rules.\n", "However, to specify this density we must give the probability associated\n", "with all possible combinations of $\\dataVector$ and $\\inputMatrix$.\n", "There are $2^{\\numData}$ possible combinations for the vector\n", "$\\dataVector$ and the probability for each of these combinations must be\n", "jointly specified along with the joint density of the matrix\n", "$\\inputMatrix$, as well as being able to *extend* the density for any\n", "chosen test location $\\inputVector^*$.\n", "\n", "In naive Bayes we make certain simplifying assumptions that allow us to\n", "perform all of the above in practice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Conditional Independence\n", "\n", "If we are given model parameters $\\paramVector$ we assume that\n", "conditioned on all these parameters that all data points in the model\n", "are independent. In other words we have, $$\n", " p(\\dataScalar^*, \\inputVector^*, \\dataVector, \\inputMatrix|\\paramVector) = p(\\dataScalar^*, \\inputVector^*|\\paramVector)\\prod_{i=1}^{\\numData} p(\\dataScalar_i, \\inputVector_i | \\paramVector).\n", " $$ This is a conditional independence assumption because we are not\n", "assuming our data are purely independent. If we were to assume that,\n", "then there would be nothing to learn about our test data given our\n", "training data. We are assuming that they are independent *given* our\n", "parameters, $\\paramVector$. We made similar assumptions for regression,\n", "where our parameter set included $\\mappingVector$ and $\\dataStd^2$.\n", "Given those parameters we assumed that the density over\n", "$\\dataVector, \\dataScalar^*$ was *independent*. Here we are going a\n", "little further with that assumption because we are assuming the *joint*\n", "density of $\\dataVector$ and $\\inputMatrix$ is independent across the\n", "data given the parameters.\n", "\n", "Computing posterior distribution in this case becomes easier, this is\n", "known as the 'Bayes classifier'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Conditional Independence\n", "\n", "$$\n", "p(\\inputVector_i | \\dataScalar_i, \\paramVector) = \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)\n", "$$ where $\\dataDim$ is the dimensionality of our inputs.\n", "\n", "The assumption that is particular to naive Bayes is to now consider that\n", "the *features* are also conditionally independent, but not only given\n", "the parameters. We assume that the features are independent given the\n", "parameters *and* the label. So for each data point we have\n", "$$p(\\inputVector_i | \\dataScalar_i, \\paramVector) = \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i,\\paramVector)$$\n", "where $\\dataDim$ is the dimensionality of our inputs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Marginal Density for $\\dataScalar_i$\n", "\n", "$$\n", "p(\\inputScalar_{i,j},\\dataScalar_i| \\paramVector) = p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i).\n", "$$\n", "\n", "We now have nearly all of the components we need to specify the full\n", "joint density. However, the feature conditional independence doesn't yet\n", "give us the joint density over $p(\\dataScalar_i, \\inputVector_i)$ which\n", "is required to subsitute in to our data conditional independence to give\n", "us the full density. To recover the joint density given the conditional\n", "distribution of each feature,\n", "$p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)$, we need to make use\n", "of the product rule and combine it with a marginal density for\n", "$\\dataScalar_i$,\n", "\n", "$$p(\\inputScalar_{i,j},\\dataScalar_i| \\paramVector) = p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i).$$\n", "Because $\\dataScalar_i$ is binary the *Bernoulli* density makes a\n", "suitable choice for our prior over $\\dataScalar_i$,\n", "$$p(\\dataScalar_i|\\pi) = \\pi^{\\dataScalar_i} (1-\\pi)^{1-\\dataScalar_i}$$\n", "where $\\pi$ now has the interpretation as being the *prior* probability\n", "that the classification should be positive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joint Density for Naive Bayes\n", "\n", "This allows us to write down the full joint density of the training\n", "data, $$\n", " p(\\dataVector, \\inputMatrix|\\paramVector, \\pi) = \\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)\n", " $$\n", "\n", "which can now be fit by maximum likelihood. As normal we form our\n", "objective as the negative log likelihood, $$\n", "\\errorFunction(\\paramVector, \\pi) = -\\log p(\\dataVector, \\inputMatrix|\\paramVector, \\pi) = -\\sum_{i=1}^{\\numData} \\sum_{j=1}^{\\dataDim} \\log p(\\inputScalar_{i, j}|\\dataScalar_i, \\paramVector) - \\sum_{i=1}^{\\numData} \\log p(\\dataScalar_i|\\pi),\n", "$$ which we note *decomposes* into two objective functions, one which is\n", "dependent on $\\pi$ alone and one which is dependent on $\\paramVector$\n", "alone so we have, $$\n", "\\errorFunction(\\pi, \\paramVector) = \\errorFunction(\\paramVector) + \\errorFunction(\\pi).\n", "$$ Since the two objective functions are separately dependent on the\n", "parameters $\\pi$ and $\\paramVector$ we can minimize them independently.\n", "Firstly, minimizing the Bernoulli likelihood over the labels we have, $$\n", "\\errorFunction(\\pi) = -\\sum_{i=1}^{\\numData}\\log p(\\dataScalar_i|\\pi) = -\\sum_{i=1}^{\\numData} \\dataScalar_i \\log \\pi - \\sum_{i=1}^{\\numData} (1-\\dataScalar_i) \\log (1-\\pi)\n", "$$ which we already minimized above recovering $$\n", "\\pi = \\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i}{\\numData}.\n", "$$\n", "\n", "We now need to minimize the objective associated with the conditional\n", "distributions for the features, $$\n", "\\errorFunction(\\paramVector) = -\\sum_{i=1}^{\\numData} \\sum_{j=1}^{\\dataDim} \\log p(\\inputScalar_{i, j} |\\dataScalar_i, \\paramVector),\n", "$$ which necessarily implies making some assumptions about the form of\n", "the conditional distributions. The right assumption will depend on the\n", "nature of our input data. For example, if we have an input which is real\n", "valued, we could use a Gaussian density and we could allow the mean and\n", "variance of the Gaussian to be different according to whether the class\n", "was positive or negative and according to which feature we were\n", "measuring. That would give us the form, $$\n", "p(\\inputScalar_{i, j} | \\dataScalar_i,\\paramVector) = \\frac{1}{\\sqrt{2\\pi \\dataStd_{\\dataScalar_i,j}^2}} \\exp \\left(-\\frac{(\\inputScalar_{i,j} - \\mu_{\\dataScalar_i, j})^2}{\\dataStd_{\\dataScalar_i,j}^2}\\right),\n", "$$ where $\\dataStd_{1, j}^2$ is the variance of the density for the\n", "$j$th output and the class $\\dataScalar_i=1$ and $\\dataStd_{0, j}^2$ is\n", "the variance if the class is 0. The means can vary similarly. Our\n", "parameters, $\\paramVector$ would consist of all the means and all the\n", "variances for the different dimensions.\n", "\n", "As normal we form our objective as the negative log likelihood, $$\n", "\\errorFunction(\\paramVector, \\pi) = -\\log p(\\dataVector, \\inputMatrix|\\paramVector, \\pi) = -\\sum_{i=1}^{\\numData} \\sum_{j=1}^{\\dataDim} \\log p(\\inputScalar_{i, j}|\\dataScalar_i, \\paramVector) - \\sum_{i=1}^{\\numData} \\log p(\\dataScalar_i|\\pi),\n", "$$ which we note *decomposes* into two objective functions, one which is\n", "dependent on $\\pi$ alone and one which is dependent on $\\paramVector$\n", "alone so we have, $$\n", "\\errorFunction(\\pi, \\paramVector) = \\errorFunction(\\paramVector) + \\errorFunction(\\pi).\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification with the Movie Body Count Data\n", "\n", "First we will load in the movie body count data. Our aim will be to\n", "predict whether a movie is rated R or not given the attributes in the\n", "data. We will predict on the basis of year, body count and movie genre.\n", "The genres in the CSV file are stored as a list in the following form:\n", "\n", "```Biography|Action|Sci-Fi```\n", "\n", "First we have to do a little work to extract this form and turn it into\n", "a vector of binary values. Let's first load in and remind ourselves of\n", "the data." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "import pods" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "data = pods.datasets.movie_body_count()['Y']" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FilmYearBody_CountMPAA_RatingGenreDirectorActorsLength_MinutesIMDB_Rating
024 Hour Party People20027R[Biography, Comedy, Drama, Music][Michael Winterbottom][Steve Coogan, John Thomson, Paul Popplewell, ...1177.4
13:10 to Yuma200745R[Adventure, Crime, Drama, Western][James Mangold][Russell Crowe, Christian Bale, Logan Lerman, ...1227.8
230020060R[Action, Fantasy, History, War][Zack Snyder][Gerard Butler, Lena Headey, Dominic West, Dav...1177.8
38MM19997R[Crime, Mystery, Thriller][Joel Schumacher][Nicolas Cage, Joaquin Phoenix, James Gandolfi...1236.4
4The Abominable Dr. Phibes197110PG-13[Fantasy, Horror][Robert Fuest][Vincent Price, Joseph Cotten, Hugh Griffith, ...947.2
\n", "
" ], "text/plain": [ " Film Year Body_Count MPAA_Rating \\\n", "0 24 Hour Party People 2002 7 R \n", "1 3:10 to Yuma 2007 45 R \n", "2 300 2006 0 R \n", "3 8MM 1999 7 R \n", "4 The Abominable Dr. Phibes 1971 10 PG-13 \n", "\n", " Genre Director \\\n", "0 [Biography, Comedy, Drama, Music] [Michael Winterbottom] \n", "1 [Adventure, Crime, Drama, Western] [James Mangold] \n", "2 [Action, Fantasy, History, War] [Zack Snyder] \n", "3 [Crime, Mystery, Thriller] [Joel Schumacher] \n", "4 [Fantasy, Horror] [Robert Fuest] \n", "\n", " Actors Length_Minutes \\\n", "0 [Steve Coogan, John Thomson, Paul Popplewell, ... 117 \n", "1 [Russell Crowe, Christian Bale, Logan Lerman, ... 122 \n", "2 [Gerard Butler, Lena Headey, Dominic West, Dav... 117 \n", "3 [Nicolas Cage, Joaquin Phoenix, James Gandolfi... 123 \n", "4 [Vincent Price, Joseph Cotten, Hugh Griffith, ... 94 \n", "\n", " IMDB_Rating \n", "0 7.4 \n", "1 7.8 \n", "2 7.8 \n", "3 6.4 \n", "4 7.2 " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will convert this data into a form which we can use as inputs\n", "`X`, and labels `y`." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "X = data[['Year', 'Body_Count']].copy()\n", "y = data['MPAA_Rating']=='R' # set label to be positive for R rated films.\n", "\n", "# Create series of movie genres with the relevant index\n", "s = data['Genre'].apply(pd.Series, 1).stack() \n", "s.index = s.index.droplevel(-1) # to line up with df's index\n", "\n", "# Extract from the series the unique list of genres.\n", "genres = s.unique()\n", "\n", "# For each genre extract the indices where it is present and add a column to X\n", "for genre in genres:\n", " index = s[s==genre].index.tolist()\n", " X.loc[:, genre] = 0.0 \n", " X.loc[index, genre] = 1.0 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has given us a new data frame `X` which contains the different\n", "genres in different columns." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearBody_CountBiographyComedyDramaMusicAdventureCrimeWesternAction...MysteryThrillerHorrorSci-FiAnimationRomanceSportFamilyMusicalFilm-Noir
count421.000000421.000000421.000000421.000000421.000000421.000000421.000000421.000000421.000000421.000000...421.000000421.000000421.000000421.00000421.000000421.000000421.000000421.000000421.000000421.000000
mean1996.49168653.2874110.0261280.1520190.3847980.0118760.2755340.3942990.0285040.629454...0.0997620.6128270.1116390.197150.0095010.0641330.0071260.0213780.0047510.002375
std10.91321082.0680350.1597060.3594660.4871260.1084590.4473150.4892810.1666040.483526...0.3000400.4876830.3152960.398320.0971250.2452810.0842140.1448120.0688420.048737
min1949.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.000000.0000000.0000000.0000000.0000000.0000000.000000
25%1991.00000011.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.000000.0000000.0000000.0000000.0000000.0000000.000000
50%2000.00000028.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.000000...0.0000001.0000000.0000000.000000.0000000.0000000.0000000.0000000.0000000.000000
75%2005.00000061.0000000.0000000.0000001.0000000.0000001.0000001.0000000.0000001.000000...0.0000001.0000000.0000000.000000.0000000.0000000.0000000.0000000.0000000.000000
max2009.000000836.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000...1.0000001.0000001.0000001.000001.0000001.0000001.0000001.0000001.0000001.000000
\n", "

8 rows × 23 columns

\n", "
" ], "text/plain": [ " Year Body_Count Biography Comedy Drama \\\n", "count 421.000000 421.000000 421.000000 421.000000 421.000000 \n", "mean 1996.491686 53.287411 0.026128 0.152019 0.384798 \n", "std 10.913210 82.068035 0.159706 0.359466 0.487126 \n", "min 1949.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 1991.000000 11.000000 0.000000 0.000000 0.000000 \n", "50% 2000.000000 28.000000 0.000000 0.000000 0.000000 \n", "75% 2005.000000 61.000000 0.000000 0.000000 1.000000 \n", "max 2009.000000 836.000000 1.000000 1.000000 1.000000 \n", "\n", " Music Adventure Crime Western Action ... \\\n", "count 421.000000 421.000000 421.000000 421.000000 421.000000 ... \n", "mean 0.011876 0.275534 0.394299 0.028504 0.629454 ... \n", "std 0.108459 0.447315 0.489281 0.166604 0.483526 ... \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "25% 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "50% 0.000000 0.000000 0.000000 0.000000 1.000000 ... \n", "75% 0.000000 1.000000 1.000000 0.000000 1.000000 ... \n", "max 1.000000 1.000000 1.000000 1.000000 1.000000 ... \n", "\n", " Mystery Thriller Horror Sci-Fi Animation Romance \\\n", "count 421.000000 421.000000 421.000000 421.00000 421.000000 421.000000 \n", "mean 0.099762 0.612827 0.111639 0.19715 0.009501 0.064133 \n", "std 0.300040 0.487683 0.315296 0.39832 0.097125 0.245281 \n", "min 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 \n", "50% 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 \n", "75% 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 \n", "\n", " Sport Family Musical Film-Noir \n", "count 421.000000 421.000000 421.000000 421.000000 \n", "mean 0.007126 0.021378 0.004751 0.002375 \n", "std 0.084214 0.144812 0.068842 0.048737 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 \n", "\n", "[8 rows x 23 columns]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now specify the naive Bayes model. For the genres we want to\n", "model the data as Bernoulli distributed, and for the year and body count\n", "we want to model the data as Gaussian distributed. We set up two data\n", "frames to contain the parameters for the rows and the columns below." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# assume data is binary or real.\n", "# this list encodes whether it is binary or real (1 for binary, 0 for real)\n", "binary_columns = genres\n", "real_columns = ['Year', 'Body_Count']\n", "Bernoulli = pd.DataFrame(data=np.zeros((2,len(binary_columns))), columns=binary_columns, index=['theta_0', 'theta_1'])\n", "Gaussian = pd.DataFrame(data=np.zeros((4,len(real_columns))), columns=real_columns, index=['mu_0', 'sigma2_0', 'mu_1', 'sigma2_1'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have the data in a form ready for analysis, let's construct our\n", "data matrix." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "num_train = 200\n", "indices = np.random.permutation(X.shape[0])\n", "train_indices = indices[:num_train]\n", "test_indices = indices[num_train:]\n", "X_train = X.loc[train_indices]\n", "y_train = y.loc[train_indices]\n", "X_test = X.loc[test_indices]\n", "y_test = y.loc[test_indices]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can now train the model. For each feature we can make the fit\n", "independently. The fit is given by either counting the number of\n", "positives (for binary data) which gives us the maximum likelihood\n", "solution for the Bernoulli. Or by computing the empirical mean and\n", "variance of the data for the Gaussian, which also gives us the maximum\n", "likelihood solution." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "for column in X_train:\n", " if column in Gaussian:\n", " Gaussian[column]['mu_0'] = X_train[column][~y].mean()\n", " Gaussian[column]['mu_1'] = X_train[column][y].mean()\n", " Gaussian[column]['sigma2_0'] = X_train[column][~y].var(ddof=0)\n", " Gaussian[column]['sigma2_1'] = X_train[column][y].var(ddof=0)\n", " if column in Bernoulli:\n", " Bernoulli[column]['theta_0'] = X_train[column][~y].sum()/(~y).sum()\n", " Bernoulli[column]['theta_1'] = X_train[column][y].sum()/(y).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can examine the nature of the distributions we've fitted to the model\n", "by looking at the entries in these data frames." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BiographyComedyDramaMusicAdventureCrimeWesternActionFantasyHistory...MysteryThrillerHorrorSci-FiAnimationRomanceSportFamilyMusicalFilm-Noir
theta_00.0102560.0615380.1230770.0051280.2153850.1589740.0256410.3589740.0461540.010256...0.0153850.2666670.0358970.1282050.0102560.0410260.0051280.0153850.0051280.0
theta_10.0265490.0973450.2522120.0132740.0707960.2212390.0132740.2610620.0309730.039823...0.0486730.3185840.0575220.0663720.0000000.0442480.0044250.0000000.0044250.0
\n", "

2 rows × 21 columns

\n", "
" ], "text/plain": [ " Biography Comedy Drama Music Adventure Crime \\\n", "theta_0 0.010256 0.061538 0.123077 0.005128 0.215385 0.158974 \n", "theta_1 0.026549 0.097345 0.252212 0.013274 0.070796 0.221239 \n", "\n", " Western Action Fantasy History ... Mystery \\\n", "theta_0 0.025641 0.358974 0.046154 0.010256 ... 0.015385 \n", "theta_1 0.013274 0.261062 0.030973 0.039823 ... 0.048673 \n", "\n", " Thriller Horror Sci-Fi Animation Romance Sport \\\n", "theta_0 0.266667 0.035897 0.128205 0.010256 0.041026 0.005128 \n", "theta_1 0.318584 0.057522 0.066372 0.000000 0.044248 0.004425 \n", "\n", " Family Musical Film-Noir \n", "theta_0 0.015385 0.005128 0.0 \n", "theta_1 0.000000 0.004425 0.0 \n", "\n", "[2 rows x 21 columns]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bernoulli" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearBody_Count
mu_01992.65957456.808511
sigma2_0148.7351744737.431417
mu_12000.41509454.292453
sigma2_146.3937345740.433339
\n", "
" ], "text/plain": [ " Year Body_Count\n", "mu_0 1992.659574 56.808511\n", "sigma2_0 148.735174 4737.431417\n", "mu_1 2000.415094 54.292453\n", "sigma2_1 46.393734 5740.433339" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Gaussian" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final model parameter is the prior probability of the positive\n", "class, $\\pi$, which is computed by maximum likelihood." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "prior = float(y_train.sum())/len(y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions\n", "\n", "Naive Bayes has given us the class conditional densities:\n", "$p(\\inputVector_i | \\dataScalar_i, \\paramVector)$. To make predictions\n", "with these densities we need to form the distribution given by $$\n", "P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector)\n", "$$ This can be computed by using the product rule. We know that $$\n", "P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector)p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector) = p(\\dataScalar*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector)\n", "$$ implying that $$\n", "P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector) = \\frac{p(\\dataScalar*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector)}{p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector)}\n", "$$ and we've already defined\n", "$p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector)$\n", "using our conditional independence assumptions above $$\n", "p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector) = \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)\n", "$$ The other required density is $$\n", "p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector)\n", "$$ which can be found from\n", "$$p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector)$$\n", "using the *sum rule* of probability, $$\n", "p(\\dataVector, \\inputMatrix, \\inputVector^*|\\paramVector) = \\sum_{\\dataScalar^*=0}^1 p(\\dataScalar^*, \\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector).\n", "$$ Because of our independence assumptions that is simply equal to $$\n", "p(\\dataVector, \\inputMatrix, \\inputVector^*| \\paramVector) = \\sum_{\\dataScalar^*=0}^1 \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi).\n", "$$ Substituting both forms in to recover our distribution over the test\n", "label conditioned on the training data we have, $$\n", "P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector) = \\frac{\\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)}{\\sum_{\\dataScalar^*=0}^1 \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)\\prod_{i=1}^{\\numData} \\prod_{j=1}^{\\dataDim} p(\\inputScalar_{i,j}|\\dataScalar_i, \\paramVector)p(\\dataScalar_i|\\pi)}\n", "$$ and we notice that all the terms associated with the training data\n", "actually cancel, the test prediction is *conditionally independent* of\n", "the training data *given* the parameters. This is a result of our\n", "conditional independence assumptions over the data points. $$\n", "p(\\dataScalar^*| \\inputVector^*, \\paramVector) = \\frac{\\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i,\n", "\\paramVector)p(\\dataScalar^*|\\pi)}{\\sum_{\\dataScalar^*=0}^1 \\prod_{j=1}^{\\dataDim} p(\\inputScalar^*_{j}|\\dataScalar^*_i, \\paramVector)p(\\dataScalar^*|\\pi)}\n", "$$ This formula is also fairly straightforward to implement. First we\n", "implement the log probabilities for the Gaussian density." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "def log_gaussian(x, mu, sigma2):\n", " return -0.5* np.log(2*np.pi*sigma2)-((x-mu)**2)/(2*sigma2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for any test point we compute the joint distribution of the Gaussian\n", "features by *summing* their log probabilities. Working in log space can\n", "be a considerable advantage over computing the probabilities directly:\n", "as the number of features we include goes up, because all the\n", "probabilities are less than 1, the joint probability will become smaller\n", "and smaller, and may be difficult to represent accurately (or even\n", "underflow). Working in log space can ameliorate this problem. We can\n", "also compute the log probability for the Bernoulli distribution." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "def log_bernoulli(x, theta):\n", " return x*np.log(theta) + (1-x)*np.log(1-theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Laplace Smoothing\n", "\n", "Before we proceed, let's just pause and think for a moment what will\n", "happen if `theta` here is either zero or one. This will result in\n", "$\\log 0 = -\\infty$ and cause numerical problems. This definitely can\n", "happen in practice. If some of the features are rare or very common\n", "across the data set then the maximum likelihood solution could find\n", "values of zero or one respectively. Such values are problematic because\n", "they cause posterior probabilities of class membership of either one or\n", "zero. In practice we deal with this using *Laplace smoothing* (which\n", "actually has an interpretation as a Bayesian fit of the Bernoulli\n", "distribution. Laplace used an example of the sun rising each day, and a\n", "wish to predict the sun rise the following day to describe his idea of\n", "smoothing, which can be found at the bottom of following page from\n", "Laplace's 'Essai Philosophique ...'" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pods\n", "pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PA16')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{ Laplace suggests that when computing the probability of an event where\n", "a success or failure is rare (he uses an example of the sun rising\n", "across the last 5,000 years or 1,826,213 days) that even though only\n", "successes have been observed (in the sun rising case) that the odds for\n", "tomorrow shouldn't be given as $$\n", "\\frac{1,826,213}{1,826,213} = 1\n", "$$ but rather by adding one to the numerator and two to the denominator,\n", "$$\n", "\\frac{1,826,213 + 1}{1,826,213 + 2} = 0.99999945.\n", "$$ This technique is sometimes called a 'pseudocount technique' because\n", "it has an intepretation of assuming some observations before you start,\n", "it's as if instead of observing $\\sum_{i}\\dataScalar_i$ successes you\n", "have an additional success, $\\sum_{i}\\dataScalar_i + 1$ and instead of\n", "having observed $n$ events you've observed $\\numData + 2$. So we can\n", "think of Laplace's idea saying (before we start) that we have 'two\n", "observations worth of belief, that the odds are 50/50', because before\n", "we start (i.e. when $\\numData=0$) our estimate is 0.5, yet because the\n", "effective $n$ is only 2, this estimate is quickly overwhelmed by data.\n", "Laplace used ideas like this a lot, and it is known as his 'principle of\n", "insufficient reason'. His idea was that in the absence of knowledge\n", "(i.e. before we start) we should assume that all possible outcomes are\n", "equally likely. This idea has a modern counterpart, known as the\n", "[principle of maximum\n", "entropy](http://en.wikipedia.org/wiki/Principle_of_maximum_entropy). A\n", "lot of the theory of this approach was developed by [Ed\n", "Jaynes](http://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes), who\n", "according to his erstwhile collaborator and friend, John Skilling,\n", "learnt French as an undergraduate by reading the works of Laplace.\n", "Although John also related that Jaynes's spoken French was not up to the\n", "standard of his scientific French. For me Ed Jaynes's work very much\n", "carries on the tradition of Laplace into the modern era, in particular\n", "his focus on Bayesian approaches. I'm very proud to have met those that\n", "knew and worked with him. It turns out that Laplace's idea also has a\n", "Bayesian interpretation (as Laplace understood), it comes from assuming\n", "a particular prior density for the parameter $\\pi$, but we won't explore\n", "that interpretation for the moment, and merely choose to estimate the\n", "probability as, $$\n", "\\pi = \\frac{\\sum_{i=1}^{\\numData} \\dataScalar_i + 1}{\\numData + 2}\n", "$$ to prevent problems with certainty causing numerical issues and\n", "misclassifications. Let's refit the Bernoulli features now." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "# fit the Bernoulli with Laplace smoothing.\n", "for column in X_train:\n", " if column in Bernoulli:\n", " Bernoulli[column]['theta_0'] = (X_train[column][~y].sum() + 1)/((~y).sum() + 2)\n", " Bernoulli[column]['theta_1'] = (X_train[column][y].sum() + 1)/((y).sum() + 2)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BiographyComedyDramaMusicAdventureCrimeWesternActionFantasyHistory...MysteryThrillerHorrorSci-FiAnimationRomanceSportFamilyMusicalFilm-Noir
theta_00.0152280.0659900.1269040.0101520.2182740.1624370.0304570.3604060.0507610.015228...0.0203050.2690360.0406090.1319800.0152280.0456850.0101520.0203050.0101520.005076
theta_10.0307020.1008770.2543860.0175440.0745610.2236840.0175440.2631580.0350880.043860...0.0526320.3201750.0614040.0701750.0043860.0482460.0087720.0043860.0087720.004386
\n", "

2 rows × 21 columns

\n", "
" ], "text/plain": [ " Biography Comedy Drama Music Adventure Crime \\\n", "theta_0 0.015228 0.065990 0.126904 0.010152 0.218274 0.162437 \n", "theta_1 0.030702 0.100877 0.254386 0.017544 0.074561 0.223684 \n", "\n", " Western Action Fantasy History ... Mystery \\\n", "theta_0 0.030457 0.360406 0.050761 0.015228 ... 0.020305 \n", "theta_1 0.017544 0.263158 0.035088 0.043860 ... 0.052632 \n", "\n", " Thriller Horror Sci-Fi Animation Romance Sport \\\n", "theta_0 0.269036 0.040609 0.131980 0.015228 0.045685 0.010152 \n", "theta_1 0.320175 0.061404 0.070175 0.004386 0.048246 0.008772 \n", "\n", " Family Musical Film-Noir \n", "theta_0 0.020305 0.010152 0.005076 \n", "theta_1 0.004386 0.008772 0.004386 \n", "\n", "[2 rows x 21 columns]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bernoulli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That places us in a position to write the prediction function." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "def predict(X_test, Gaussian, Bernoulli, prior):\n", " log_positive = pd.Series(data = np.zeros(X_test.shape[0]), index=X_test.index)\n", " log_negative = pd.Series(data = np.zeros(X_test.shape[0]), index=X_test.index)\n", " for column in X_test.columns:\n", " if column in Gaussian:\n", " log_positive += log_gaussian(X_test[column], Gaussian[column]['mu_1'], Gaussian[column]['sigma2_1'])\n", " log_negative += log_gaussian(X_test[column], Gaussian[column]['mu_0'], Gaussian[column]['sigma2_0'])\n", " elif column in Bernoulli:\n", " log_positive += log_bernoulli(X_test[column], Bernoulli[column]['theta_1'])\n", " log_negative += log_bernoulli(X_test[column], Bernoulli[column]['theta_0'])\n", " \n", " return np.exp(log_positive + np.log(prior))/(np.exp(log_positive + np.log(prior)) + np.exp(log_negative + np.log(1-prior)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are in a position to make the predictions for the test data." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "p_y = predict(X_test, Gaussian, Bernoulli, prior)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test the quality of the predictions in the following way.\n", "Firstly, we can threshold our probabilities at 0.5, allocating points\n", "with greater than 50% probability of membership of the positive class to\n", "the positive class. We can then compare to the true values, and see how\n", "many of these values we got correct. This is our total number correct." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total correct 176 out of 221 which is 0.7963800904977375 %\n" ] } ], "source": [ "correct = y_test.eq(p_y>0.5)\n", "total_correct = sum(correct)\n", "print(\"Total correct\", total_correct, \" out of \", len(y_test), \"which is\", float(total_correct)/len(y_test), \"%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also now plot the [confusion\n", "matrix](http://en.wikipedia.org/wiki/Confusion_matrix). A confusion\n", "matrix tells us where we are making mistakes. Along the diagonal it\n", "stores the *true positives*, the points that were positive class that we\n", "classified correctly, and the *true negatives*, the points that were\n", "negative class and that we classified correctly. The off diagonal terms\n", "contain the false positives and the false negatives. Along the rows of\n", "the matrix we place the actual class, and along the columns we place our\n", "predicted class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(y_test & ~(p_y>0.5)).sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "confusion_matrix(y_test, p_y>0.5, labels=None, sample_weight=None)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predicted not R-ratedpredicted R-rated
actual not R-rated72.029.0
actual R-rated16.0104.0
\n", "
" ], "text/plain": [ " predicted not R-rated predicted R-rated\n", "actual not R-rated 72.0 29.0\n", "actual R-rated 16.0 104.0" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix = pd.DataFrame(data=np.zeros((2,2)), \n", " columns=['predicted not R-rated', 'predicted R-rated'],\n", " index =['actual not R-rated','actual R-rated'])\n", "confusion_matrix['predicted R-rated']['actual R-rated'] = (y_test & (p_y>0.5)).sum()\n", "confusion_matrix['predicted R-rated']['actual not R-rated'] = (~y_test & (p_y>0.5)).sum()\n", "confusion_matrix['predicted not R-rated']['actual R-rated'] = (y_test & ~(p_y>0.5)).sum()\n", "confusion_matrix['predicted not R-rated']['actual not R-rated'] = (~y_test & ~(p_y>0.5)).sum()\n", "confusion_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "How can you improve your classification, are all the features equally\n", "valid? Are some features more helpful than others? What happens if you\n", "remove features that appear to be less helpful. How might you select\n", "such features?\n", "\n", "### Write your answer to Exercise here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this box for any code you need for the exercise\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "We have decided to classify positive if probability of R rating is\n", "greater than 0.5. This has led us to accidentally classify some films as\n", "'safe for children' when the aren't in actuallity. Imagine you wish to\n", "ensure that the film is safe for children. With your test set how low do\n", "you have to set the threshold to avoid all the false negatives (i.e.\n", "films where you said it wasn't R-rated, but in actuality it was?\n", "\n", "### Write your answer to Exercise here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this box for any code you need for the exercise\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Naive Bayes has given us the class conditional densities:\n", "$p(\\inputVector_i | \\dataScalar_i, \\paramVector)$. To make predictions\n", "with these densities we need to form the distribution given by $$\n", "P(\\dataScalar^*| \\dataVector, \\inputMatrix, \\inputVector^*, \\paramVector)\n", "$$\n", "\n", "### Exercise\n", "\n", "Write down the negative log likelihood of the Gaussian density over a\n", "vector of variables $\\inputVector$. Assume independence between each\n", "variable. Minimize this objective to obtain the maximum likelihood\n", "solution of the form. $$\n", "\\mu = \\frac{\\sum_{i=1}^{\\numData} \\inputScalar_i}{\\numData}\n", "$$ $$\n", "\\dataStd^2 = \\frac{\\sum_{i=1}^{\\numData} (\\inputScalar_i - \\mu)^2}{\\numData}\n", "$$\n", "\n", "### Write your answer to Exercise here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this box for any code you need for the exercise\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the input data was *binary* then we could also make use of the\n", "Bernoulli distribution for the features. For that case we would have the\n", "form, $$\n", "p(\\inputScalar_{i, j} | \\dataScalar_i,\\paramVector) = \\theta_{\\dataScalar_i, j}^{\\inputScalar_{i, j}}(1-\\theta_{\\dataScalar_i, j})^{(1-\\inputScalar_{i,j})},\n", "$$ where $\\theta_{1, j}$ is the probability that the $j$th feature is on\n", "if $\\dataScalar_i$ is 1.\n", "\n", "In either case, maximum likelihood fitting would proceed in the same\n", "way. The objective has the form, $$\n", "\\errorFunction(\\paramVector) = -\\sum_{j=1}^{\\dataDim} \\sum_{i=1}^{\\numData} \\log p(\\inputScalar_{i,j} |\\dataScalar_i, \\paramVector),\n", "$$ and if, as above, the parameters of the distributions are specific to\n", "each feature vector (we had means and variances for each continuous\n", "feature, and a probability for each binary feature) then we can use the\n", "fact that these parameters separate into disjoint subsets across the\n", "features to write, $$\n", "\\begin{align*}\n", "\\errorFunction(\\paramVector) &= -\\sum_{j=1}^{\\dataDim} \\sum_{i=1}^{\\numData} \\log\n", "p(\\inputScalar_{i,j} |\\dataScalar_i, \\paramVector_j)\\\\\n", "& \\sum_{j=1}^{\\dataDim}\n", "\\errorFunction(\\paramVector_j),\n", "\\end{align*}\n", "$$ which means we can minimize our objective on each feature\n", "independently.\n", "\n", "These characteristics mean that naive Bayes scales very well with big\n", "data. To fit the model we consider each feature in turn, we select the\n", "positive class and fit parameters for that class, then we select each\n", "negative class and fit features for that class. We have code below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive Bayes Summary\n", "\n", "Naive Bayes is making very simple assumptions about the data, in\n", "particular it is modeling the full *joint* probability of the data set,\n", "$p(\\dataVector, \\inputMatrix | \\paramVector, \\pi)$ by very strong\n", "assumptions about factorizations that are unlikely to be true in\n", "practice. The data conditional independence assumption is common, and\n", "relies on a rich parameter vector to absorb all the information in the\n", "training data. The additional assumption of naive Bayes is that features\n", "are conditional independent given the class label $\\dataScalar_i$ (and\n", "the parameter vector, $\\paramVector$. This is quite a strong assumption.\n", "However, it causes the objective function to decompose into parts which\n", "can be independently fitted to the different feature vectors, meaning it\n", "is very easy to fit the model to large data. It is also clear how we\n", "should handle *streaming* data and *missing* data. This means that the\n", "model can be run 'live', adapting parameters and information as it\n", "arrives. Indeed, the model is even capable of dealing with new\n", "*features* that might arrive at run time. Such is the strength of the\n", "modeling the joint probability density. However, the factorization\n", "assumption that allows us to do this efficiently is very strong and may\n", "lead to poor decision boundaries in practice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Reading\n", "\n", "Chapter 5 of Rogers and Girolami (2011) up to pg 179 (Section 5.1, and 5.2 up to 5.2.2).\n", "\n", "

References

\n", "
\n", "
\n", "

Pearl, J., 1995. From Bayesian networks to causal networks, in: Gammerman, A. (Ed.), Probabilistic Reasoning and Bayesian Belief Networks. Alfred Waller, pp. 1–31.

\n", "
\n", "
\n", "

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.

\n", "
\n", "
\n", "

Steele, S., Bilchik, A., Eberhardt, J., Kalina, P., Nissan, A., Johnson, E., Avital, I., Stojadinovic, A., 2012. Using machine-learned Bayesian belief networks to predict perioperative risk of clostridium difficile infection following colon surgery. Interact J Med Res 1, e6. https://doi.org/10.2196/ijmr.2131

\n", "
\n", "
\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }