{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Introduction to Machine Learning Systems\n", "========================================\n", "\n", "### [Eric Meissner](https://www.linkedin.com/in/meissnereric/)\n", "\n", "### [Andrei Paleyes](https://www.linkedin.com/in/andreipaleyes/)\n", "\n", "### [Neil D. Lawrence](http://inverseprobability.com)\n", "\n", "### 2020-07-24" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abstract**: This notebook introduces some of the challenges of\n", "building machine learning data systems. It will introduce you to\n", "concepts around joining of databases together. The storage and\n", "manipulation of data is at the core of machine learning systems and data\n", "science. The goal of this notebook is to introduce the reader to these\n", "concepts, not to authoritatively answer any questions about the state of\n", "Nigerian health facilities or Covid19, but it may give you ideas about\n", "how to try and do that in your own country." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\newcommand{\\tk}[1]{}\n", "%\\newcommand{\\tk}[1]{\\textbf{TK}: #1}\n", "\\newcommand{\\Amatrix}{\\mathbf{A}}\n", "\\newcommand{\\KL}[2]{\\text{KL}\\left( #1\\,\\|\\,#2 \\right)}\n", "\\newcommand{\\Kaast}{\\kernelMatrix_{\\mathbf{ \\ast}\\mathbf{ \\ast}}}\n", "\\newcommand{\\Kastu}{\\kernelMatrix_{\\mathbf{ \\ast} \\inducingVector}}\n", "\\newcommand{\\Kff}{\\kernelMatrix_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kfu}{\\kernelMatrix_{\\mappingFunctionVector \\inducingVector}}\n", "\\newcommand{\\Kuast}{\\kernelMatrix_{\\inducingVector \\bf\\ast}}\n", "\\newcommand{\\Kuf}{\\kernelMatrix_{\\inducingVector \\mappingFunctionVector}}\n", "\\newcommand{\\Kuu}{\\kernelMatrix_{\\inducingVector \\inducingVector}}\n", "\\newcommand{\\Kuui}{\\Kuu^{-1}}\n", "\\newcommand{\\Qaast}{\\mathbf{Q}_{\\bf \\ast \\ast}}\n", "\\newcommand{\\Qastf}{\\mathbf{Q}_{\\ast \\mappingFunction}}\n", "\\newcommand{\\Qfast}{\\mathbf{Q}_{\\mappingFunctionVector \\bf \\ast}}\n", "\\newcommand{\\Qff}{\\mathbf{Q}_{\\mappingFunctionVector \\mappingFunctionVector}}\n", "\\newcommand{\\aMatrix}{\\mathbf{A}}\n", "\\newcommand{\\aScalar}{a}\n", "\\newcommand{\\aVector}{\\mathbf{a}}\n", "\\newcommand{\\acceleration}{a}\n", "\\newcommand{\\bMatrix}{\\mathbf{B}}\n", "\\newcommand{\\bScalar}{b}\n", "\\newcommand{\\bVector}{\\mathbf{b}}\n", "\\newcommand{\\basisFunc}{\\phi}\n", "\\newcommand{\\basisFuncVector}{\\boldsymbol{ \\basisFunc}}\n", "\\newcommand{\\basisFunction}{\\phi}\n", "\\newcommand{\\basisLocation}{\\mu}\n", "\\newcommand{\\basisMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\basisScalar}{\\basisFunction}\n", "\\newcommand{\\basisVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\activationFunction}{\\phi}\n", "\\newcommand{\\activationMatrix}{\\boldsymbol{ \\Phi}}\n", "\\newcommand{\\activationScalar}{\\basisFunction}\n", "\\newcommand{\\activationVector}{\\boldsymbol{ \\basisFunction}}\n", "\\newcommand{\\bigO}{\\mathcal{O}}\n", "\\newcommand{\\binomProb}{\\pi}\n", "\\newcommand{\\cMatrix}{\\mathbf{C}}\n", "\\newcommand{\\cbasisMatrix}{\\hat{\\boldsymbol{ \\Phi}}}\n", "\\newcommand{\\cdataMatrix}{\\hat{\\dataMatrix}}\n", "\\newcommand{\\cdataScalar}{\\hat{\\dataScalar}}\n", "\\newcommand{\\cdataVector}{\\hat{\\dataVector}}\n", "\\newcommand{\\centeredKernelMatrix}{\\mathbf{ \\MakeUppercase{\\centeredKernelScalar}}}\n", "\\newcommand{\\centeredKernelScalar}{b}\n", "\\newcommand{\\centeredKernelVector}{\\centeredKernelScalar}\n", "\\newcommand{\\centeringMatrix}{\\mathbf{H}}\n", "\\newcommand{\\chiSquaredDist}[2]{\\chi_{#1}^{2}\\left(#2\\right)}\n", "\\newcommand{\\chiSquaredSamp}[1]{\\chi_{#1}^{2}}\n", "\\newcommand{\\conditionalCovariance}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\coregionalizationMatrix}{\\mathbf{B}}\n", "\\newcommand{\\coregionalizationScalar}{b}\n", "\\newcommand{\\coregionalizationVector}{\\mathbf{ \\coregionalizationScalar}}\n", "\\newcommand{\\covDist}[2]{\\text{cov}_{#2}\\left(#1\\right)}\n", "\\newcommand{\\covSamp}[1]{\\text{cov}\\left(#1\\right)}\n", "\\newcommand{\\covarianceScalar}{c}\n", "\\newcommand{\\covarianceVector}{\\mathbf{ \\covarianceScalar}}\n", "\\newcommand{\\covarianceMatrix}{\\mathbf{C}}\n", "\\newcommand{\\covarianceMatrixTwo}{\\boldsymbol{ \\Sigma}}\n", "\\newcommand{\\croupierScalar}{s}\n", "\\newcommand{\\croupierVector}{\\mathbf{ \\croupierScalar}}\n", "\\newcommand{\\croupierMatrix}{\\mathbf{ \\MakeUppercase{\\croupierScalar}}}\n", "\\newcommand{\\dataDim}{p}\n", "\\newcommand{\\dataIndex}{i}\n", "\\newcommand{\\dataIndexTwo}{j}\n", "\\newcommand{\\dataMatrix}{\\mathbf{Y}}\n", "\\newcommand{\\dataScalar}{y}\n", "\\newcommand{\\dataSet}{\\mathcal{D}}\n", "\\newcommand{\\dataStd}{\\sigma}\n", "\\newcommand{\\dataVector}{\\mathbf{ \\dataScalar}}\n", "\\newcommand{\\decayRate}{d}\n", "\\newcommand{\\degreeMatrix}{\\mathbf{ \\MakeUppercase{\\degreeScalar}}}\n", "\\newcommand{\\degreeScalar}{d}\n", "\\newcommand{\\degreeVector}{\\mathbf{ \\degreeScalar}}\n", "% Already defined by latex\n", "%\\newcommand{\\det}[1]{\\left|#1\\right|}\n", "\\newcommand{\\diag}[1]{\\text{diag}\\left(#1\\right)}\n", "\\newcommand{\\diagonalMatrix}{\\mathbf{D}}\n", "\\newcommand{\\diff}[2]{\\frac{\\text{d}#1}{\\text{d}#2}}\n", "\\newcommand{\\diffTwo}[2]{\\frac{\\text{d}^2#1}{\\text{d}#2^2}}\n", "\\newcommand{\\displacement}{x}\n", "\\newcommand{\\displacementVector}{\\textbf{\\displacement}}\n", "\\newcommand{\\distanceMatrix}{\\mathbf{ \\MakeUppercase{\\distanceScalar}}}\n", "\\newcommand{\\distanceScalar}{d}\n", "\\newcommand{\\distanceVector}{\\mathbf{ \\distanceScalar}}\n", "\\newcommand{\\eigenvaltwo}{\\ell}\n", "\\newcommand{\\eigenvaltwoMatrix}{\\mathbf{L}}\n", "\\newcommand{\\eigenvaltwoVector}{\\mathbf{l}}\n", "\\newcommand{\\eigenvalue}{\\lambda}\n", "\\newcommand{\\eigenvalueMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\eigenvalueVector}{\\boldsymbol{ \\lambda}}\n", "\\newcommand{\\eigenvector}{\\mathbf{ \\eigenvectorScalar}}\n", "\\newcommand{\\eigenvectorMatrix}{\\mathbf{U}}\n", "\\newcommand{\\eigenvectorScalar}{u}\n", "\\newcommand{\\eigenvectwo}{\\mathbf{v}}\n", "\\newcommand{\\eigenvectwoMatrix}{\\mathbf{V}}\n", "\\newcommand{\\eigenvectwoScalar}{v}\n", "\\newcommand{\\entropy}[1]{\\mathcal{H}\\left(#1\\right)}\n", "\\newcommand{\\errorFunction}{E}\n", "\\newcommand{\\expDist}[2]{\\left<#1\\right>_{#2}}\n", "\\newcommand{\\expSamp}[1]{\\left<#1\\right>}\n", "\\newcommand{\\expectation}[1]{\\left\\langle #1 \\right\\rangle }\n", "\\newcommand{\\expectationDist}[2]{\\left\\langle #1 \\right\\rangle _{#2}}\n", "\\newcommand{\\expectedDistanceMatrix}{\\mathcal{D}}\n", "\\newcommand{\\eye}{\\mathbf{I}}\n", "\\newcommand{\\fantasyDim}{r}\n", "\\newcommand{\\fantasyMatrix}{\\mathbf{ \\MakeUppercase{\\fantasyScalar}}}\n", "\\newcommand{\\fantasyScalar}{z}\n", "\\newcommand{\\fantasyVector}{\\mathbf{ \\fantasyScalar}}\n", "\\newcommand{\\featureStd}{\\varsigma}\n", "\\newcommand{\\gammaCdf}[3]{\\mathcal{GAMMA CDF}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaDist}[3]{\\mathcal{G}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gammaSamp}[2]{\\mathcal{G}\\left(#1,#2\\right)}\n", "\\newcommand{\\gaussianDist}[3]{\\mathcal{N}\\left(#1|#2,#3\\right)}\n", "\\newcommand{\\gaussianSamp}[2]{\\mathcal{N}\\left(#1,#2\\right)}\n", "\\newcommand{\\given}{|}\n", "\\newcommand{\\half}{\\frac{1}{2}}\n", "\\newcommand{\\heaviside}{H}\n", "\\newcommand{\\hiddenMatrix}{\\mathbf{ \\MakeUppercase{\\hiddenScalar}}}\n", "\\newcommand{\\hiddenScalar}{h}\n", "\\newcommand{\\hiddenVector}{\\mathbf{ \\hiddenScalar}}\n", "\\newcommand{\\identityMatrix}{\\eye}\n", "\\newcommand{\\inducingInputScalar}{z}\n", "\\newcommand{\\inducingInputVector}{\\mathbf{ \\inducingInputScalar}}\n", "\\newcommand{\\inducingInputMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\inducingScalar}{u}\n", "\\newcommand{\\inducingVector}{\\mathbf{ \\inducingScalar}}\n", "\\newcommand{\\inducingMatrix}{\\mathbf{U}}\n", "\\newcommand{\\inlineDiff}[2]{\\text{d}#1/\\text{d}#2}\n", "\\newcommand{\\inputDim}{q}\n", "\\newcommand{\\inputMatrix}{\\mathbf{X}}\n", "\\newcommand{\\inputScalar}{x}\n", "\\newcommand{\\inputSpace}{\\mathcal{X}}\n", "\\newcommand{\\inputVals}{\\inputVector}\n", "\\newcommand{\\inputVector}{\\mathbf{ \\inputScalar}}\n", "\\newcommand{\\iterNum}{k}\n", "\\newcommand{\\kernel}{\\kernelScalar}\n", "\\newcommand{\\kernelMatrix}{\\mathbf{K}}\n", "\\newcommand{\\kernelScalar}{k}\n", "\\newcommand{\\kernelVector}{\\mathbf{ \\kernelScalar}}\n", "\\newcommand{\\kff}{\\kernelScalar_{\\mappingFunction \\mappingFunction}}\n", "\\newcommand{\\kfu}{\\kernelVector_{\\mappingFunction \\inducingScalar}}\n", "\\newcommand{\\kuf}{\\kernelVector_{\\inducingScalar \\mappingFunction}}\n", "\\newcommand{\\kuu}{\\kernelVector_{\\inducingScalar \\inducingScalar}}\n", "\\newcommand{\\lagrangeMultiplier}{\\lambda}\n", "\\newcommand{\\lagrangeMultiplierMatrix}{\\boldsymbol{ \\Lambda}}\n", "\\newcommand{\\lagrangian}{L}\n", "\\newcommand{\\laplacianFactor}{\\mathbf{ \\MakeUppercase{\\laplacianFactorScalar}}}\n", "\\newcommand{\\laplacianFactorScalar}{m}\n", "\\newcommand{\\laplacianFactorVector}{\\mathbf{ \\laplacianFactorScalar}}\n", "\\newcommand{\\laplacianMatrix}{\\mathbf{L}}\n", "\\newcommand{\\laplacianScalar}{\\ell}\n", "\\newcommand{\\laplacianVector}{\\mathbf{ \\ell}}\n", "\\newcommand{\\latentDim}{q}\n", "\\newcommand{\\latentDistanceMatrix}{\\boldsymbol{ \\Delta}}\n", "\\newcommand{\\latentDistanceScalar}{\\delta}\n", "\\newcommand{\\latentDistanceVector}{\\boldsymbol{ \\delta}}\n", "\\newcommand{\\latentForce}{f}\n", "\\newcommand{\\latentFunction}{u}\n", "\\newcommand{\\latentFunctionVector}{\\mathbf{ \\latentFunction}}\n", "\\newcommand{\\latentFunctionMatrix}{\\mathbf{ \\MakeUppercase{\\latentFunction}}}\n", "\\newcommand{\\latentIndex}{j}\n", "\\newcommand{\\latentScalar}{z}\n", "\\newcommand{\\latentVector}{\\mathbf{ \\latentScalar}}\n", "\\newcommand{\\latentMatrix}{\\mathbf{Z}}\n", "\\newcommand{\\learnRate}{\\eta}\n", "\\newcommand{\\lengthScale}{\\ell}\n", "\\newcommand{\\rbfWidth}{\\ell}\n", "\\newcommand{\\likelihoodBound}{\\mathcal{L}}\n", "\\newcommand{\\likelihoodFunction}{L}\n", "\\newcommand{\\locationScalar}{\\mu}\n", "\\newcommand{\\locationVector}{\\boldsymbol{ \\locationScalar}}\n", "\\newcommand{\\locationMatrix}{\\mathbf{M}}\n", "\\newcommand{\\variance}[1]{\\text{var}\\left( #1 \\right)}\n", "\\newcommand{\\mappingFunction}{f}\n", "\\newcommand{\\mappingFunctionMatrix}{\\mathbf{F}}\n", "\\newcommand{\\mappingFunctionTwo}{g}\n", "\\newcommand{\\mappingFunctionTwoMatrix}{\\mathbf{G}}\n", "\\newcommand{\\mappingFunctionTwoVector}{\\mathbf{ \\mappingFunctionTwo}}\n", "\\newcommand{\\mappingFunctionVector}{\\mathbf{ \\mappingFunction}}\n", "\\newcommand{\\scaleScalar}{s}\n", "\\newcommand{\\mappingScalar}{w}\n", "\\newcommand{\\mappingVector}{\\mathbf{ \\mappingScalar}}\n", "\\newcommand{\\mappingMatrix}{\\mathbf{W}}\n", "\\newcommand{\\mappingScalarTwo}{v}\n", "\\newcommand{\\mappingVectorTwo}{\\mathbf{ \\mappingScalarTwo}}\n", "\\newcommand{\\mappingMatrixTwo}{\\mathbf{V}}\n", "\\newcommand{\\maxIters}{K}\n", "\\newcommand{\\meanMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanScalar}{\\mu}\n", "\\newcommand{\\meanTwoMatrix}{\\mathbf{M}}\n", "\\newcommand{\\meanTwoScalar}{m}\n", "\\newcommand{\\meanTwoVector}{\\mathbf{ \\meanTwoScalar}}\n", "\\newcommand{\\meanVector}{\\boldsymbol{ \\meanScalar}}\n", "\\newcommand{\\mrnaConcentration}{m}\n", "\\newcommand{\\naturalFrequency}{\\omega}\n", "\\newcommand{\\neighborhood}[1]{\\mathcal{N}\\left( #1 \\right)}\n", "\\newcommand{\\neilurl}{http://inverseprobability.com/}\n", "\\newcommand{\\noiseMatrix}{\\boldsymbol{ E}}\n", "\\newcommand{\\noiseScalar}{\\epsilon}\n", "\\newcommand{\\noiseVector}{\\boldsymbol{ \\epsilon}}\n", "\\newcommand{\\norm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\normalizedLaplacianMatrix}{\\hat{\\mathbf{L}}}\n", "\\newcommand{\\normalizedLaplacianScalar}{\\hat{\\ell}}\n", "\\newcommand{\\normalizedLaplacianVector}{\\hat{\\mathbf{ \\ell}}}\n", "\\newcommand{\\numActive}{m}\n", "\\newcommand{\\numBasisFunc}{m}\n", "\\newcommand{\\numComponents}{m}\n", "\\newcommand{\\numComps}{K}\n", "\\newcommand{\\numData}{n}\n", "\\newcommand{\\numFeatures}{K}\n", "\\newcommand{\\numHidden}{h}\n", "\\newcommand{\\numInducing}{m}\n", "\\newcommand{\\numLayers}{\\ell}\n", "\\newcommand{\\numNeighbors}{K}\n", "\\newcommand{\\numSequences}{s}\n", "\\newcommand{\\numSuccess}{s}\n", "\\newcommand{\\numTasks}{m}\n", "\\newcommand{\\numTime}{T}\n", "\\newcommand{\\numTrials}{S}\n", "\\newcommand{\\outputIndex}{j}\n", "\\newcommand{\\paramVector}{\\boldsymbol{ \\theta}}\n", "\\newcommand{\\parameterMatrix}{\\boldsymbol{ \\Theta}}\n", "\\newcommand{\\parameterScalar}{\\theta}\n", "\\newcommand{\\parameterVector}{\\boldsymbol{ \\parameterScalar}}\n", "\\newcommand{\\partDiff}[2]{\\frac{\\partial#1}{\\partial#2}}\n", "\\newcommand{\\precisionScalar}{j}\n", "\\newcommand{\\precisionVector}{\\mathbf{ \\precisionScalar}}\n", "\\newcommand{\\precisionMatrix}{\\mathbf{J}}\n", "\\newcommand{\\pseudotargetScalar}{\\widetilde{y}}\n", "\\newcommand{\\pseudotargetVector}{\\mathbf{ \\pseudotargetScalar}}\n", "\\newcommand{\\pseudotargetMatrix}{\\mathbf{ \\widetilde{Y}}}\n", "\\newcommand{\\rank}[1]{\\text{rank}\\left(#1\\right)}\n", "\\newcommand{\\rayleighDist}[2]{\\mathcal{R}\\left(#1|#2\\right)}\n", "\\newcommand{\\rayleighSamp}[1]{\\mathcal{R}\\left(#1\\right)}\n", "\\newcommand{\\responsibility}{r}\n", "\\newcommand{\\rotationScalar}{r}\n", "\\newcommand{\\rotationVector}{\\mathbf{ \\rotationScalar}}\n", "\\newcommand{\\rotationMatrix}{\\mathbf{R}}\n", "\\newcommand{\\sampleCovScalar}{s}\n", "\\newcommand{\\sampleCovVector}{\\mathbf{ \\sampleCovScalar}}\n", "\\newcommand{\\sampleCovMatrix}{\\mathbf{s}}\n", "\\newcommand{\\scalarProduct}[2]{\\left\\langle{#1},{#2}\\right\\rangle}\n", "\\newcommand{\\sign}[1]{\\text{sign}\\left(#1\\right)}\n", "\\newcommand{\\sigmoid}[1]{\\sigma\\left(#1\\right)}\n", "\\newcommand{\\singularvalue}{\\ell}\n", "\\newcommand{\\singularvalueMatrix}{\\mathbf{L}}\n", "\\newcommand{\\singularvalueVector}{\\mathbf{l}}\n", "\\newcommand{\\sorth}{\\mathbf{u}}\n", "\\newcommand{\\spar}{\\lambda}\n", "\\newcommand{\\trace}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\BasalRate}{B}\n", "\\newcommand{\\DampingCoefficient}{C}\n", "\\newcommand{\\DecayRate}{D}\n", "\\newcommand{\\Displacement}{X}\n", "\\newcommand{\\LatentForce}{F}\n", "\\newcommand{\\Mass}{M}\n", "\\newcommand{\\Sensitivity}{S}\n", "\\newcommand{\\basalRate}{b}\n", "\\newcommand{\\dampingCoefficient}{c}\n", "\\newcommand{\\mass}{m}\n", "\\newcommand{\\sensitivity}{s}\n", "\\newcommand{\\springScalar}{\\kappa}\n", "\\newcommand{\\springVector}{\\boldsymbol{ \\kappa}}\n", "\\newcommand{\\springMatrix}{\\boldsymbol{ \\mathcal{K}}}\n", "\\newcommand{\\tfConcentration}{p}\n", "\\newcommand{\\tfDecayRate}{\\delta}\n", "\\newcommand{\\tfMrnaConcentration}{f}\n", "\\newcommand{\\tfVector}{\\mathbf{ \\tfConcentration}}\n", "\\newcommand{\\velocity}{v}\n", "\\newcommand{\\sufficientStatsScalar}{g}\n", "\\newcommand{\\sufficientStatsVector}{\\mathbf{ \\sufficientStatsScalar}}\n", "\\newcommand{\\sufficientStatsMatrix}{\\mathbf{G}}\n", "\\newcommand{\\switchScalar}{s}\n", "\\newcommand{\\switchVector}{\\mathbf{ \\switchScalar}}\n", "\\newcommand{\\switchMatrix}{\\mathbf{S}}\n", "\\newcommand{\\tr}[1]{\\text{tr}\\left(#1\\right)}\n", "\\newcommand{\\loneNorm}[1]{\\left\\Vert #1 \\right\\Vert_1}\n", "\\newcommand{\\ltwoNorm}[1]{\\left\\Vert #1 \\right\\Vert_2}\n", "\\newcommand{\\onenorm}[1]{\\left\\vert#1\\right\\vert_1}\n", "\\newcommand{\\twonorm}[1]{\\left\\Vert #1 \\right\\Vert}\n", "\\newcommand{\\vScalar}{v}\n", "\\newcommand{\\vVector}{\\mathbf{v}}\n", "\\newcommand{\\vMatrix}{\\mathbf{V}}\n", "\\newcommand{\\varianceDist}[2]{\\text{var}_{#2}\\left( #1 \\right)}\n", "% Already defined by latex\n", "%\\newcommand{\\vec}{#1:}\n", "\\newcommand{\\vecb}[1]{\\left(#1\\right):}\n", "\\newcommand{\\weightScalar}{w}\n", "\\newcommand{\\weightVector}{\\mathbf{ \\weightScalar}}\n", "\\newcommand{\\weightMatrix}{\\mathbf{W}}\n", "\\newcommand{\\weightedAdjacencyMatrix}{\\mathbf{A}}\n", "\\newcommand{\\weightedAdjacencyScalar}{a}\n", "\\newcommand{\\weightedAdjacencyVector}{\\mathbf{ \\weightedAdjacencyScalar}}\n", "\\newcommand{\\onesVector}{\\mathbf{1}}\n", "\\newcommand{\\zerosVector}{\\mathbf{0}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question\n", "--------\n", "\n", "In this notebook, we explore the question of health facility\n", "distribution in Nigeria, spatially, and in relation to population\n", "density.\n", "\n", "We answer and visualize the question “How does the number of health\n", "facilities per capita vary across Nigeria?”\n", "\n", "Rather than focussing purely on using tools like `pandas` to manipulate\n", "the data, our focus will be on introducing some concepts from databases.\n", "\n", "Machine learning can be summarized as $$\n", "\\text{model} + \\text{data} \\xrightarrow{\\text{compute}} \\text{prediction}\n", "$$ and many machine learning courses focus a lot on the model part. But\n", "to build a machine learning system in practice, a lot of work has to be\n", "put into the data part. This notebook gives some pointers on that work\n", "and how to think about your machine learning systems design." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Datasets\n", "--------\n", "\n", "In this notebook , we download 3 datasets: \\* Nigeria NMIS health\n", "facility data \\* Population data for Administrative Zone 1 (states)\n", "areas in Nigeria \\* Map boundaries for Nigerian states (for plotting and\n", "binning) \\* Covid cases across Nigeria (as of May 20, 2020)\n", "\n", "But joining these data sets together is just an example. As another\n", "example, you could think of [SafeBoda](https://safeboda.com/ng/), a\n", "ride-hailing app that’s available in Lagos and Kampala. As well as\n", "looking at the health examples, try to imagine how SafeBoda may have had\n", "to design their systems to be scalable and reliable for storing and\n", "sharing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imports, Installs, and Downloads\n", "--------------------------------\n", "\n", "First, we’re going to download some particular python libraries for\n", "dealing with geospatial data. We’re dowloading\n", "[`geopandas`](https://geopandas.org) which will help us deal with ‘shape\n", "files’ that give the geographical lay out of Nigeria. We’ll also\n", "download [`pygeos`](https://pygeos.readthedocs.io/en/latest/), a library\n", "for dealing with points rapidly in python. And finally, to get a small\n", "database set up running quickly, we’re installing\n", "[`csv-to-sqlite`](https://pypi.org/project/csv-to-sqlite/) which allows\n", "us to convert CSV data to a simple database." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install geopandas pygeos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Databases and Joins\n", "-------------------\n", "\n", "The main idea we will be working with today is called the ‘join’. A join\n", "does exactly what it sounds like, it combines two database tables.\n", "\n", "You have already started to look at data structures, in particular you\n", "have been learning about `pandas` which is a great way of storing and\n", "structuring your data set to make it easier to plot and manipulate your\n", "data.\n", "\n", "Pandas is great for the data scientist to analyze data because it makes\n", "many operations easier. But it is not so good for building the machine\n", "learning system. In a machine learning system, you may have to handle a\n", "lot of data. Even if you start with building a system where you only\n", "have a few customers, perhaps you build an online taxi system (like\n", "SafeBoda) for Kampala. Maybe you will have 50 customers. Then maybe your\n", "system can be handled with some python scripts and pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scaling ML Systems\n", "------------------\n", "\n", "But what if you are succesful? What if everyone in Kampala wants to use\n", "your system? There are 1.5 million people in Kampala and maybe 100,000\n", "Boda Boda drivers.\n", "\n", "What if you are even more succesful? What if everyone in Lagos wants to\n", "use your system? There are around 20 million people in Lagos … and maybe\n", "as many Okada drivers as people in Kampala!\n", "\n", "We want to build safe and reliable machine learning systems. Building\n", "them from pandas and python is about as safe and reliable as [taking six\n", "children to school on a boda\n", "boda](https://www.monitor.co.ug/News/National/Boda-accidents-kill-10-city-UN-report-Kampala/688334-4324032-15oru2dz/index.html).\n", "\n", "To build a reliable system, we need to turn to *databases*. In this\n", "notebook [we’ll be focussing on SQL\n", "databases](https://en.wikipedia.org/wiki/Join_(SQL)) and how you bring\n", "together different streams of data in a Machine Learning System.\n", "\n", "In a machine learning system, you will need to bring different data sets\n", "together. In database terminology this is known as a ‘join’. You have\n", "two different data sets, and you want to join them together. Just like\n", "you can join two pieces of metal using a welder, or two pieces of wood\n", "with screws.\n", "\n", "But instead of using a welder or screws to join data, we join it using\n", "particular columns of the data. We can join data together using people’s\n", "names. One database may contain where people live, another database may\n", "contain where they go to school. If we join these two databases we can\n", "have a database which shows where people live and where they got to\n", "school.\n", "\n", "In the notebook, we will join together some data about where the health\n", "centres are in Nigeria and where the have been cases of Covid19. There\n", "are other challenges in the ML System Design that are not going to be\n", "covered here. They include: how to update the data bases, and how to\n", "control access to the data bases from different users (boda boda\n", "drivers, riders, administrators etc)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hospital Data\n", "-------------\n", "\n", "The first and primary dataset we use is the NMIS health facility\n", "dataset, which contains data on the location, type, and staffing of\n", "health facilities across Nigeria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "urllib.request.urlretrieve('https://energydata.info/dataset/f85d1796-e7f2-4630-be84-79420174e3bd/resource/6e640a13-cab4-457b-b9e6-0336051bac27/download/healthmopupandbaselinenmisfacility.csv', 'healthmopupandbaselinenmisfacility.csv')\n", "hospital_data = pd.read_csv('healthmopupandbaselinenmisfacility.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It’s always a good idea to inspect your data once it’s downloaded to\n", "check it contains what you expect. In `pandas` you can do this with the\n", "`.head()` method. That allows us to see the first few entries of the\n", "`pandas` data structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hospital_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also check in `pandas` what the different columns of the data\n", "frame are to see what it contains." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hospital_data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can immiediately see that there are facility names, dates, and some\n", "characteristics of each health center such as number of doctors etc. As\n", "well as all that, we have two fields, `latitude` and `longitude` that\n", "likely give us the hospital locaiton. Let’s plot them to have a look." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(hospital_data.longitude, hospital_data.latitude,'ro', alpha=0.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There we have the location of these different hospitals. We set alpha in\n", "the plot to 0.01 to make the dots transparent, so we can see the\n", "locations of each health center." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Administrative Zone Geo Data\n", "----------------------------\n", "\n", "A very common operation is the need to map from locations in a country\n", "to the administrative regions. If we were building a ride sharing app,\n", "we might also want to map riders to locations in the city, so that we\n", "could know how many riders we had in different city areas.\n", "\n", "Administrative regions have various names like cities, counties,\n", "districts or states. These conversions for the administrative regions\n", "are important for getting the right information to the right people.\n", "\n", "Of course, if we had a knowlegdeable Nigerian, we could ask her about\n", "what the right location for each of these health facilities is, which\n", "state is it in? But given that we have the latitude and longitude, we\n", "should be able to find out automatically what the different states are.\n", "\n", "This is where “geo” data becomes important. We need to download a\n", "dataset that stores the location of the different states in Nigeria.\n", "These files are known as ‘outline’ files. Because the draw the different\n", "states of different countries in outline.\n", "\n", "There are special databases for storing this type of information, the\n", "database we are using is in the `gdb` or GeoDataBase format. It comes in\n", "a zip file. Let’s download the outline files for the Nigerian states.\n", "They have been made available by the [Humanitarian Data\n", "Exchange](https://data.humdata.org/), you can also find other states\n", "data from the same site." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import zipfile" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "admin_zones_url = 'https://data.humdata.org/dataset/81ac1d38-f603-4a98-804d-325c658599a3/resource/0bc2f7bb-9ff6-40db-a569-1989b8ffd3bc/download/nga_admbnda_osgof_eha_itos.gdb.zip'\n", "_, msg = urllib.request.urlretrieve(admin_zones_url, 'nga_admbnda_osgof_eha_itos.gdb.zip')\n", "with zipfile.ZipFile('/content/nga_admbnda_osgof_eha_itos.gdb.zip', 'r') as zip_ref:\n", " zip_ref.extractall('/content/nga_admbnda_osgof_eha_itos.gdb')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have this data of the outlines of the different states in\n", "Nigeria.\n", "\n", "The next thing we need to know is how these health facilities map onto\n", "different states in Nigeria. Without “binning” facilities somehow, it’s\n", "difficult to effectively see how they are distributed across the\n", "country.\n", "\n", "We do this by finding a “geo” dataset that contains the spatial outlay\n", "of Nigerian states by latitude/longitude coordinates. The dataset we use\n", "is of the “gdb” (GeoDataBase) type and comes as a zip file. We don’t\n", "need to worry much about this datatype for this notebook, only noting\n", "that geopandas knows how to load in the dataset, and that it contains\n", "different “layers” for the same map. In this case, each layer is a\n", "different degree of granularity of the political boundaries, with layer\n", "0 being the whole country, 1 is by state, or 2 is by local government.\n", "We’ll go with a state level view for simplicity, but as an excercise you\n", "can change it to layer 2 to view the distribution by local government.\n", "\n", "Once we have these `MultiPolygon` objects that define the boundaries of\n", "different states, we can perform a spatial join (sjoin) from the\n", "coordinates of individual health facilities (which we already converted\n", "to the appropriate `Point` type when moving the health data to a\n", "GeoDataFrame.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Joining a GeoDataFrame\n", "----------------------\n", "\n", "The first database join we’re going to do is a special one, it’s a\n", "‘spatial join’. We’re going to join together the locations of the\n", "hospitals with their states.\n", "\n", "This join is unusual because it requires some mathematics to get right.\n", "The outline files give us the borders of the different states in\n", "latitude and longitude, the health facilities have given locations in\n", "the country.\n", "\n", "A spatial join involves finding out which state each health facility\n", "belongs to. Fortunately, the mathematics you need is already programmed\n", "for you in GeoPandas. That means all we need to do is convert our\n", "`pandas` dataframe of health facilities into a `GeoDataFrame` which\n", "allows us to do the spatial join." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geopandas as gpd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hosp_gdf = gpd.GeoDataFrame(\n", " hospital_data, geometry=gpd.points_from_xy(hospital_data.longitude, hospital_data.latitude))\n", "hosp_gdf.crs = \"EPSG:4326\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some technial details here: the `crs` refers to the coordinate\n", "system in use by a particular GeoDataFrame. `EPSG:4326` is the standard\n", "coordinate system of latitude/longitude." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your First Join: Converting GPS Coordinates to States\n", "-----------------------------------------------------\n", "\n", "Now we have the data in the `GeoPandas` format, we can start converting\n", "into states. We will use the [`fiona`](https://pypi.org/project/Fiona/)\n", "library for reading the right layers from the files. Before we do the\n", "join, lets plot the location of health centers and states on the same\n", "map." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fiona" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "states_file = \"/content/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/nga_admbnda_osgof_eha_itos.gdb/\"\n", "\n", "# geopandas included map, filtered to just Nigeria\n", "world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))\n", "world.crs = \"EPSG:4326\"\n", "nigeria = world[(world['name'] == 'Nigeria')]\n", "base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11))\n", "\n", "layers = fiona.listlayers(states_file)\n", "zones_gdf = gpd.read_file(states_file, layer=1)\n", "zones_gdf.crs = \"EPSG:4326\"\n", "zones_gdf = zones_gdf.set_index('admin1Name_en')\n", "zones_gdf.plot(ax=base, color='white', edgecolor='black')\n", "\n", "# We can now plot our ``GeoDataFrame``.\n", "hosp_gdf.plot(ax=base, color='b', alpha=0.02, )\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performing the Spatial Join\n", "---------------------------\n", "\n", "We’ve now plotted the different health center locations across the\n", "states. You can clearly see that each of the dots falls within a\n", "different state. For helping the visualisation, we’ve made the dots\n", "somewhat transparent (we set the `alpha` in the plot). This means that\n", "we can see the regions where there are more health centers, you should\n", "be able to spot where the major cities in Nigeria are given the\n", "increased number of health centers in those regions.\n", "\n", "Of course, we can now see by eye, which of the states each of the health\n", "centers belongs to. But we want the computer to do our join for us.\n", "`GeoPandas` provides us with the spatial join. Here we’re going to do a\n", "[`left` or `outer`\n", "join](https://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from geopandas.tools import sjoin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have two GeoPandas data frames, `hosp_gdf` and `zones_gdf`. Let’s\n", "have a look at the columns the contain." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hosp_gdf.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that this is the GeoDataFrame containing the information\n", "about the hospital. Now let’s have a look at the `zones_gdf` data frame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "zones_gdf.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that this data frame has a different set of columns. It has\n", "all the different administrative regions. But there is one column name\n", "that overlaps. We can find it by looking for the intersection between\n", "the two sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(hosp_gdf.columns).intersection(set(zones_gdf.columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we’ve converted the lists of columns into python ‘sets’, and then\n", "looked for the intersection. The *join* will occur on the intersection\n", "between these columns. It will try and match the geometry of the\n", "hospitals (their location) to the geometry of the states (their\n", "outlines). This match is done in one line in GeoPandas.\n", "\n", "We’re having to use GeoPandas because this join is a special one based\n", "on geographical locations, if the join was on customer name or some\n", "other discrete variable, we could do the join in pandas or directly in\n", "SQL." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hosp_state_joined = sjoin(hosp_gdf, zones_gdf, how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The intersection of the two data frames indicates how the two data\n", "frames will be joined (if there’s no intersection, they can’t be\n", "joined). It’s like indicating the two holes that would need to be bolted\n", "together on two pieces of metal. If the holes don’t match, the join\n", "can’t be done. There has to be an intersection.\n", "\n", "But what will the result look like? Well the join should be the ‘union’\n", "of the two data frames. We can have a look at what the union should be\n", "by (again) converting the columns to sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(hosp_gdf.columns).union(set(zones_gdf.columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That gives a list of all the columns (notice that ‘geometry’ only\n", "appears once).\n", "\n", "Let’s check that’s what the join command did, by looking at the columns\n", "of our new data frame, `hosp_state_joined`. Notice also that there’s a\n", "new column: `index_right`. The two original data bases had separate\n", "indices. The `index_right` column represents the index from the\n", "`zones_gdf`, which is the Nigerian state." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(hosp_state_joined.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! They are all there! We have completed our join. We had two\n", "separate data frames with information about states and information about\n", "hospitals. But by performing an ‘outer’ or a ‘left’ join, we now have a\n", "single data frame with all the information in the same place! Let’s have\n", "a look at the first frew entries in the new data frame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hosp_state_joined.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL Database\n", "------------\n", "\n", "Our first join was a special one, because it involved spatial data. That\n", "meant using the special `gdb` format and the `GeoPandas` tool for\n", "manipulating that data. But we’ve now saved our updated data in a new\n", "file.\n", "\n", "To do this, we use the command line utility that comes standard for\n", "SQLite database creation. SQLite is a simple database that’s useful for\n", "playing with database commands on your local machine. For a real system,\n", "you would need to set up a server to run the database. The server is a\n", "separate machine with the job of answering database queries. SQLite\n", "pretends to be a proper database, but doesn’t require us to go to the\n", "extra work of setting up a server. Popular SQL server software includes\n", "[`MySQL`](https://www.mysql.com/) which is free or [Microsoft’s SQL\n", "Server](https://www.microsoft.com/en-gb/sql-server/sql-server-2019).\n", "\n", "A typical machine learning installation might have you running a\n", "database from a cloud service (such as AWS, Azure or Google Cloud\n", "Platform). That cloud service would host the database for you and you\n", "would pay according to the number of queries made.\n", "\n", "Many start-up companies were formed on the back of a `MySQL` server\n", "hosted on top of AWS. You can [read how to do that\n", "here](https://aws.amazon.com/getting-started/hands-on/create-mysql-db/).\n", "\n", "If you were designing your own ride hailing app, or any other major\n", "commercial software you would want to investigate whether you would need\n", "to set up a central SQL server in one of these frameworks.\n", "\n", "Today though, we’ll just stick to SQLite which gives you a sense of the\n", "database without the time and expense of setting it up on the cloud. As\n", "well as showing you the SQL commands (which is often what’s used in a\n", "production ML system) we’ll also give the equivalent `pandas` commands,\n", "which would often be what you would use when you’re doing data analysis\n", "in `python` and `Jupyter`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the SQLite Database\n", "--------------------------\n", "\n", "The beautiful thing about SQLite is that it allows us to play with SQL\n", "without going to the work of setting up a proper SQL server. Creating a\n", "data base in SQLite is as simple as writing a new file. To create the\n", "database, we’ll first write our joined data to a CSV file, then we’ll\n", "use a little utility to convert our hospital database into a SQLite\n", "database." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hosp_state_joined.to_csv('facilities.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install csv-to-sqlite" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!csv-to-sqlite -f facilities.csv -t full -o db.sqlite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than being installed on a separate server, SQLite simply stores\n", "the database locally in a file called `db.sqlite`.\n", "\n", "In the database there can be several ‘tables’. Each table can be thought\n", "of as like a separate dataframe. The table name we’ve just saved is\n", "‘hospitals\\_zones\\_joined’." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the SQL Database\n", "--------------------------\n", "\n", "Now that we have a SQL database, we can create a connection to it and\n", "query it using SQL commands. Let’s try to simply select the data we\n", "wrote to it, to make sure its the same.\n", "\n", "Start by making a connection to the database. This will often be done\n", "via remote connections, but for this example we’ll connect locally to\n", "the database using the filepath directly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sqlite3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_connection(db_file):\n", " \"\"\" create a database connection to the SQLite database\n", " specified by the db_file\n", " :param db_file: database file\n", " :return: Connection object or None\n", " \"\"\"\n", " conn = None\n", " try:\n", " conn = sqlite3.connect(db_file)\n", " except Error as e:\n", " print(e)\n", "\n", " return conn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conn = create_connection(\"db.sqlite\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a connection, we can write a command and pass it to the\n", "database.\n", "\n", "To access a data base, the first thing that is made is a connection.\n", "Then SQL is used to extract the information required. A typical SQL\n", "command is `SELECT`. It allows us to extract rows from a given table. It\n", "operates a bit like the `.head()` method in `pandas`, it will return the\n", "first `N` rows (by default the `.head()` command returns the first 5\n", "rows, but you can set `n` to whatever you like. Here we’ve included a\n", "default value of 5 to make it match the `pandas` command.\n", "\n", "The python library, `sqlite3`, allows us to access the SQL database\n", "directly from python. We do this using an `execute` command on the\n", "connection.\n", "\n", "Typically, its good software engineering practice to ‘wrap’ the database\n", "command in some python code. This allows the commands to be maintained.\n", "Below we wrap the SQL command\n", "\n", " SELECT * FROM [table_name] LIMIT : N\n", "\n", "in python code. This SQL command selects the first `N` entries from a\n", "given database called `table_name`.\n", "\n", "We can pass the `table_name` and number of rows, `N` to the python\n", "command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def select_top(conn, table, n):\n", " \"\"\"\n", " Query n first rows of the table\n", " :param conn: the Connection object\n", " :param table: The table to query\n", " :param n: Number of rows to query\n", " \"\"\"\n", " cur = conn.cursor()\n", " cur.execute(f\"SELECT * FROM [{table}] LIMIT :limitNum\", {\"limitNum\": n})\n", "\n", " rows = cur.fetchall()\n", " return rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s have a go at calling the command to extract the first three\n", "facilities from our health center database. Let’s try creating a\n", "function that does the same thing the pandas .head() method does so we\n", "can inspect our database." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def head(conn, table, n=5):\n", " rows = select_top(conn, table, n)\n", " for r in rows:\n", " print(r)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head(conn, 'facilities')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We now have the data base in SQLite, and some python functions\n", "that operate on the data base by wrapping SQL commands.\n", "\n", "We will return to the SQL command style after download and add the other\n", "datasets to the database using a combination of `pandas` and the\n", "`csv-to-sqlite` utility.\n", "\n", "Our next task will be to introduce data on COVID19 so that we can join\n", "that to our other data sets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Covid Data\n", "----------\n", "\n", "Now we have the health data, we’re going to combine it with [data about\n", "COVID-19 cases in Nigeria over\n", "time](https://github.com/dsfsi/covid19africa). This data is kindly\n", "provided by Africa open COVID-19 data working group, which Elaine\n", "Nsoesie has been working with. The data is taken from Twitter, and only\n", "goes up until May 2020.\n", "\n", "They provide their data in github. We can access the cases we’re\n", "interested in from the following URL.\n", "\n", "For convenience, we’ll load the data into pandas first, but our next\n", "step will be to create a new SQLite table containing the data. Then\n", "we’ll join that table to our existing tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "covid_data_url = 'https://raw.githubusercontent.com/dsfsi/covid19africa/master/data/line_lists/line-list-nigeria.csv'\n", "covid_data_csv = 'cases.csv'\n", "urllib.request.urlretrieve(covid_data_url, covid_data_csv)\n", "covid_data = pd.read_csv(covid_data_csv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As normal, we should inspect our data to check that it contains what we\n", "expect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "covid_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can get an idea of all the information in the data from looking\n", "at the columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "covid_data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we convert this CSV file we’ve downloaded into a new table in the\n", "database file. We can do this, again, with the csv-to-sqlite script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!csv-to-sqlite -f cases.csv -t full -o db.sqlite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Population Data\n", "---------------\n", "\n", "Now we have information about COVID cases, and we have information about\n", "how many health centers and how many doctors and nurses there are in\n", "each health center. But unless we understand how many people there are\n", "in each state, then we cannot make decisions about where they may be\n", "problems with the disease.\n", "\n", "If we were running our ride hailing service, we would also need\n", "information about how many people there were in different areas, so we\n", "could understand what the *demand* for the boda boda rides might be.\n", "\n", "To access the number of people we can get population statistics from the\n", "[Humanitarian Data Exchange](https://data.humdata.org/).\n", "\n", "We also want to have population data for each state in Nigeria, so that\n", "we can see attributes like whether there are zones of high health\n", "facility density but low population density." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_url = 'https://data.humdata.org/dataset/a7c3de5e-ff27-4746-99cd-05f2ad9b1066/resource/d9fc551a-b5e4-4bed-9d0d-b047b6961817/download/nga_pop_adm1_2016.csv'\n", "_, msg = urllib.request.urlretrieve(pop_url,'nga_pop_adm1_2016.csv')\n", "pop_data = pd.read_csv('nga_pop_adm1_2016.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do joins with this data, we must first make sure that the columns\n", "have the right names. The name should match the same name of the column\n", "in our existing data. So we reset the column names, and the name of the\n", "index, as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_data.columns = ['admin1Name_en', 'admin1Pcode', 'admin0Name_en', 'admin0Pcode', 'population']\n", "pop_data = pop_data.set_index('admin1Name_en')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When doing this for real world data, you should also make sure that the\n", "names used in the rows are the same across the different data bases. For\n", "example, has someone decided to use an abbreviation for ‘Federal Capital\n", "Territory’ and set it as ‘FCT’. The computer won’t understand these are\n", "the same states, and if you do a join with such data you can get\n", "duplicate entries or missing entries. This sort of thing happens a lot\n", "in real world data and takes a lot of time to sort out. Fortunately, in\n", "this case, the data is well curated and we don’t have these problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save to database file\n", "---------------------\n", "\n", "The next step is to add this new CSV file as an additional table in our\n", "SQLite database. This is done using the script as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_data.to_csv('pop_data.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!csv-to-sqlite -f pop_data.csv -t full -o db.sqlite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computing per capita hospitals and COVID\n", "----------------------------------------\n", "\n", "The Minister of Health in Abuja may be interested in which states are\n", "most vulnerable to COVID19. We now have all the information in our SQL\n", "data bases to compute what our health center provision is per capita,\n", "and what the COVID19 situation is.\n", "\n", "To do this, we will use the `JOIN` operation from SQL and introduce a\n", "new operation called `GROUPBY`.\n", "\n", "#### Joining in Pandas\n", "\n", "As before, these operations can be done in pandas or GeoPandas. Before\n", "we create the SQL commands, we’ll show how you can do that in pandas.\n", "\n", "In pandas, the equivalent of a database table is a dataframe. So the\n", "JOIN operation takes two dataframes and joins them based on the key. The\n", "key is that special shared column between the two tables. The place\n", "where the ‘holes align’ so the two databases can be joined together.\n", "\n", "In GeoPandas we used an outer join. In an outer join you keep all rows\n", "from both tables, even if there is no match on the key. In an inner\n", "join, you only keep the rows if the two tables have a matching key.\n", "\n", "This is sometimes where problems can creep in. If in one table Abuja’s\n", "state is encoded as ‘FCT’ or ‘FCT-Abuja’, and in another table it’s\n", "encoded as ‘Federal Capital Territory’, they won’t match and that data\n", "wouldn’t appear in the joined table.\n", "\n", "In simple terms, a JOIN operation takes two tables (or dataframes) and\n", "combines them based on some key, in this case the index of the Pandas\n", "data frame which is the state name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_joined = zones_gdf.join(pop_data['population'], how='inner')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GroupBy in Pandas\n", "-----------------\n", "\n", "Our COVID19 data is in the form of individual cases. But we are\n", "interested in total case counts for each state. There is a special data\n", "base operation known as `GROUP BY` for collecting information about the\n", "individual states. The type of information you might want could be a\n", "sum, the maximum value, an average, the minimum value. We can use a\n", "GroupBy operation in `pandas` and SQL to summarize the counts of covid\n", "cases in each state.\n", "\n", "A `GROUPBY` operation groups rows with the same key (in this case\n", "‘province/state’) into separate objects, that we can operate on further\n", "such as to count the rows in each group, or to sum or take the mean over\n", "the values in some column (imagine each case row had the age of the\n", "patient, and you were interested in the mean age of patients.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "covid_cases_by_state = covid_data.groupby(['province/state']).count()['case_id']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.groupby()` method on the dataframe has now given us a new data\n", "series that contains the total number of covid cases in each state. We\n", "can examine it to check we have something sensible." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "covid_cases_by_state" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have this new data series, it can be added to the pandas data\n", "frame as a new column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_joined['covid_cases_by_state'] = covid_cases_by_state" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The spatial join we did on the original data frame to obtain\n", "hosp\\_state\\_joined introduced a new column, index\\_right which contains\n", "the state of each of the hospitals. Let’s have a quick look at it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hosp_state_joined['index_right']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To count the hospitals in each of the states, we first create a grouped\n", "series where we’ve grouped on these states." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = hosp_state_joined.groupby('index_right')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This python operation now goes through each of the groups and counts how\n", "many hospitals there are in each state. It stores the result in a\n", "dictionary. If you’re new to Python, then to understand this code you\n", "need to understand what a ‘dictionary comprehension’ is. In this case\n", "the dictionary comprehension is being used to create a python dictionary\n", "of states and total hospital counts. That’s then being converted into a\n", "`pandas` Data Series and added to the `pop_joined` dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "counted_groups = {k: len(v) for k, v in grouped.groups.items()}\n", "pop_joined['hosp_state'] = pd.Series(counted_groups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we can now add a new data series to the data frame that\n", "contains the per capita information about hospitals. that makes it easy\n", "to retrieve later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pop_joined['hosp_per_capita_10k'] = (pop_joined['hosp_state'] * 10000 )/ pop_joined['population']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL-style\n", "---------\n", "\n", "That’s the `pandas` approach to doing it. But `pandas` itself is\n", "inspired by database language, in particular relational databases such\n", "as SQL. To do these types of joins at scale, e.g. for our ride hailing\n", "app, we need to see how to do these joins in a database.\n", "\n", "As before, we’ll wrap the underlying SQL commands with a convenient\n", "python command.\n", "\n", "What you see below gives the full SQL command. There is a [`SELECT`\n", "command](https://www.w3schools.com/sql/sql_select.asp), which extracts\n", "`FROM` a particular table. It then completes an\n", "[`INNER JOIN`](https://www.w3schools.com/sql/sql_join_inner.asp) using\n", "particular columns (`provice/state` and `index_right`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def join_counts(conn):\n", " \"\"\"\n", " Calculate counts of cases and facilities per state, join results\n", " \"\"\"\n", " cur = conn.cursor()\n", " cur.execute(\"\"\"\n", " SELECT ct.[province/state] as [state], ct.[case_count], ft.[facility_count]\n", " FROM\n", " (SELECT [province/state], COUNT(*) as [case_count] FROM [cases] GROUP BY [province/state]) ct\n", " INNER JOIN \n", " (SELECT [index_right], COUNT(*) as [facility_count] FROM [facilities] GROUP BY [index_right]) ft\n", " ON\n", " ct.[province/state] = ft.[index_right]\n", " \"\"\")\n", "\n", " rows = cur.fetchall()\n", " return rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we’ve created our python wrapper, we can connect to the data base\n", "and run our SQL command on the database using the wrapper." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conn = create_connection(\"db.sqlite\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "state_cases_hosps = join_counts(conn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for row in state_cases_hosps:\n", " print(\"State {} \\t\\t Covid Cases {} \\t\\t Health Facilities {}\".format(row[0], row[1], row[2]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11))\n", "pop_joined.plot(ax=base, column='population', edgecolor='black', legend=True)\n", "base.set_title(\"Population of Nigerian States\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11))\n", "pop_joined.plot(ax=base, column='hosp_per_capita_10k', edgecolor='black', legend=True)\n", "base.set_title(\"Hospitals Per Capita (10k) of Nigerian States\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "::: {.cell .markdown}\n", "\n", "Exercise\n", "--------\n", "\n", "1. Add a new column the dataframe for covid cases per 10,000\n", " population, in the same way we computed health facilities per 10k\n", " capita.\n", "\n", "2. Add a new column for covid cases per health facility.\n", "\n", "Do this in both the SQL and the Pandas styles to get a feel for how they\n", "differ.\n", "\n", "{:::" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pop_joined['cases_per_capita_10k'] = ???\n", "# pop_joined['cases_per_facility'] = ???" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11))\n", "pop_joined.plot(ax=base, column='cases_per_capita_10k', edgecolor='black', legend=True)\n", "base.set_title(\"Covid Cases Per Capita (10k) of Nigerian States\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11))\n", "pop_joined.plot(ax=base, column='covid_cases_by_state', edgecolor='black', legend=True)\n", "base.set_title(\"Covid Cases by State\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base = nigeria.plot(color='white', edgecolor='black', alpha=0, figsize=(11, 11))\n", "pop_joined.plot(ax=base, column='cases_per_facility', edgecolor='black', legend=True)\n", "base.set_title(\"Covid Cases per Health Facility\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks!\n", "-------\n", "\n", "For more information on these subjects and more you might want to check\n", "the following resources.\n", "\n", "- twitter: [@lawrennd](https://twitter.com/lawrennd)\n", "- podcast: [The Talking Machines](http://thetalkingmachines.com)\n", "- newspaper: [Guardian Profile\n", " Page](http://www.theguardian.com/profile/neil-lawrence)\n", "- blog:\n", " [http://inverseprobability.com](http://inverseprobability.com/blog.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "References\n", "----------" ] } ], "nbformat": 4, "nbformat_minor": 5, "metadata": {} }