{ "metadata": { "name": "", "signature": "sha256:fa3f990ba6edb35b53a1be5472262565e9091a9bd21fcc246e0e5b7830fa5d4a" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Merging Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, you will have an opportunity to explore several different ways of *merging* data using pandas and *pasndasql*.\n", "\n", "As you will see, the pandas `DataFrame()` is a very flexible data structure that supports many operations *in memory* that have traditionally been performed by database management systems." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introducing the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data we will use for this activity comes from the Department for Communities and Local Government [Open Data Communities](http://opendatacommunities.org/) website. Two sorts of data have been downloaded - information about the [average weekly social rent of new PRP (Private Registered Providers) general needs lettings for 2012/13](http://opendatacommunities.org/data/households/social-lettings/general-needs/rents/prp/number-bedrooms) and data relating to housebuilding, in particular the [permanent dwellings started from 2009/10 to 2012/13](http://opendatacommunities.org/data/house-building/starts/tenure)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ls data/housingdata" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The house building data files all have a similar form:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head data/housingdata/house-building-starts-tenure-2009-2010.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head data/housingdata/house-building-starts-tenure-2010-2011.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lettings data uses a scheme for identifying the different local authorities:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have also pulled down a file from the Ordnance Survey that contains a list of geographical areas within the [Yorkshire and the Humber](http://data.ordnancesurvey.co.uk/doc/7000000000041430) region, some of which are local councils and some of which aren't... Note that the data that identifies each authority appears to resemble that used in the DCLG data files but does not match exactly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head -n 5 data/housingdata/yorksAndHumberside.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Loading the House Building Data In" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the data in from the CSV files using the pandas `read_csv()` function. For the housing data, we need to skip the first five lines (I counted!) of the file before accepting the header." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Read in some of the data\n", "bldg_2009_10=pd.read_csv('data/housingdata/house-building-starts-tenure-2009-2010.csv',skiprows=5)\n", "bldg_2010_11=pd.read_csv('data/housingdata/house-building-starts-tenure-2010-2011.csv',skiprows=5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Preview the data we have loaded\n", "bldg_2009_10[:5]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#YOUR TURN\n", "#Import the remaning house building files into separate dataframes\n", "#As an additional exercise, you may want to try to develop a script to automate the loading of the data from the separate files" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Merging the House Building Data Over Years - Concatenation/Vertical Joins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to work with a single dataframe that contains annulaised house building starts data over the period 2009-2013. The pandas `concat()` function cill concatenate rows from a list dataframes where each dataframe shares the same column headings.\n", "\n", "Let's create a couple of samples from the tables just to try this function out." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Just use a sample of the data rows for now as we develop the code\n", "sample1=bldg_2009_10[:3]\n", "sample2=bldg_2010_11[:3]\n", "sample2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualise the shape of the sample dataframe using a [wrapper around ipythonblocks](http://blog.ouseful.info/2014/03/26/visualising-pandas-dataframes-with-ipythonblocks-proof-of-concept/). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pBlocks_demo as pb" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid3(sample1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows us that there are 6 columns and 3 rows in the dataset. The columns are coloured according to the datatype of each column." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Try out the .concat() function - pass in a list of dataframes to be concatenated\n", "pd.concat( [sample1, sample2] )" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That should have worked okay... We can visualise what has happened by colouring the rows that have been joined from each dataframe." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#The pBlockGrid is still all experimental - we should pass a list here for example...\n", "df_concat,b=pb.pBlockGrid5(sample1, sample2)\n", "b" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows us that the concatenated table has 6 columns and 6 rows, with 3 rows from the first table and 3 rows from the second." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens when we try to merge to two complete dataframes?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bldg_2009_11=pd.concat([bldg_2009_10,bldg_2010_11])\n", "#Check to see if the dataframes appear to have been concatenated together by inspecting row counts\n", "print(len(bldg_2009_10),len(bldg_2010_11),len(bldg_2009_11))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if the dataframes have the same column names, but they appear in a different order?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Create a sample dataframe containing the same columns as the original but in a different order\n", "sample3=bldg_2009_10[['Reference area','All','Housing-Associations','http://opendatacommunities.org/def/ontology/geography/refArea','Local-Authority','Private-Enterprise']][:3]\n", "sample3" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Let's look at the shape and datatype structure of sample1\n", "pb.pBlockGrid3(sample1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#And sample2? We see the columns are differently ordered in terms of datatype\n", "pb.pBlockGrid3(sample3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Concatenate some sample dataframes with the same columns, but differently ordered\n", "concat_difforder = pd.concat([sample1,sample3])\n", "concat_difforder" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#What is the datatype structure of the resulting concatenated datatable?\n", "pb.pBlockGrid3(concat_difforder)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas is capable of automatically aligning the columns from such data frames.\n", "\n", "What happens if we try to concatenate dataframes in which the dataframes only partially share columns?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Create a sample dataframe that contains only a subset of the columns from an original dataframe\n", "sample4=bldg_2009_10[['Reference area','All','Housing-Associations']][:3]\n", "sample4" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Concatenate two data frames with non-identical columns\n", "concat_diffcolumns = pd.concat([sample1,sample4])\n", "concat_diffcolumns " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can colour any cells that have missing values to show the 'holes' in the resulting dataframe:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Here I am going to \"colour\" the missing cells white.\n", "#The colour field accepts (red, green, blue) values in the range 0..255 each\n", "pb.pBlockGrid4(concat_diffcolumns,color_NA=(255,255,255))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `concat()` function aligns columns where it can. By default, the columns in the combined dataframe are the superset of distinctly named columns in the concatenated data frame. Missing values are given a `NaN` value.\n", "\n", "This form of concatenation is an *outer join* in the sense that we are producing a set of columns in the output that represent the union of columns contained in the concatenated datasets.\n", "\n", "The `concat()` functions uses the outer style join by default, but we can also force it to adopt an *inner join* behaviour in which the columns in the output data frame correspond to the *intersection* of columns from the input data frames compared to the *union* of the *outer join*." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Explicitly use an INNER join (\"inner\") on the concatenation; \"outer\" is the default value\n", "concat_inner = pd.concat([sample1,sample4],join='inner')\n", "concat_inner" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#pBlockGrid TO DO\n", "pb.pBlockGrid5(sample1,sample4,join='inner')[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise:\n", "What problems, if any, can you see in interpreting the data in any of the concatenated data sets produced above, and how might they be resolved? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Answer: although the data items represent reports from different years, we have lost that information. The year the reports refer to are not encoded in the actual rows of data. If we add an additional column to each dataset as it is loaded in that contains the year the report relates to, we can carry that information in to the concatenated data set." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Adding in the Lost MetaData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So how might we add in an additional data column that identifies the period the data relates to before we concatenate the separate data frames?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#YOUR ATTEMPT HERE" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how I did it: \n", "\n", "\n", "\n", "\n", "\n", "