{ "metadata": { "name": "", "signature": "sha256:fa3f990ba6edb35b53a1be5472262565e9091a9bd21fcc246e0e5b7830fa5d4a" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Merging Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, you will have an opportunity to explore several different ways of *merging* data using pandas and *pasndasql*.\n", "\n", "As you will see, the pandas `DataFrame()` is a very flexible data structure that supports many operations *in memory* that have traditionally been performed by database management systems." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introducing the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data we will use for this activity comes from the Department for Communities and Local Government [Open Data Communities](http://opendatacommunities.org/) website. Two sorts of data have been downloaded - information about the [average weekly social rent of new PRP (Private Registered Providers) general needs lettings for 2012/13](http://opendatacommunities.org/data/households/social-lettings/general-needs/rents/prp/number-bedrooms) and data relating to housebuilding, in particular the [permanent dwellings started from 2009/10 to 2012/13](http://opendatacommunities.org/data/house-building/starts/tenure)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ls data/housingdata" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The house building data files all have a similar form:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head data/housingdata/house-building-starts-tenure-2009-2010.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head data/housingdata/house-building-starts-tenure-2010-2011.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lettings data uses a scheme for identifying the different local authorities:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have also pulled down a file from the Ordnance Survey that contains a list of geographical areas within the [Yorkshire and the Humber](http://data.ordnancesurvey.co.uk/doc/7000000000041430) region, some of which are local councils and some of which aren't... Note that the data that identifies each authority appears to resemble that used in the DCLG data files but does not match exactly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head -n 5 data/housingdata/yorksAndHumberside.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Loading the House Building Data In" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the data in from the CSV files using the pandas `read_csv()` function. For the housing data, we need to skip the first five lines (I counted!) of the file before accepting the header." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Read in some of the data\n", "bldg_2009_10=pd.read_csv('data/housingdata/house-building-starts-tenure-2009-2010.csv',skiprows=5)\n", "bldg_2010_11=pd.read_csv('data/housingdata/house-building-starts-tenure-2010-2011.csv',skiprows=5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Preview the data we have loaded\n", "bldg_2009_10[:5]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#YOUR TURN\n", "#Import the remaning house building files into separate dataframes\n", "#As an additional exercise, you may want to try to develop a script to automate the loading of the data from the separate files" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Merging the House Building Data Over Years - Concatenation/Vertical Joins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to work with a single dataframe that contains annulaised house building starts data over the period 2009-2013. The pandas `concat()` function cill concatenate rows from a list dataframes where each dataframe shares the same column headings.\n", "\n", "Let's create a couple of samples from the tables just to try this function out." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Just use a sample of the data rows for now as we develop the code\n", "sample1=bldg_2009_10[:3]\n", "sample2=bldg_2010_11[:3]\n", "sample2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualise the shape of the sample dataframe using a [wrapper around ipythonblocks](http://blog.ouseful.info/2014/03/26/visualising-pandas-dataframes-with-ipythonblocks-proof-of-concept/). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pBlocks_demo as pb" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid3(sample1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows us that there are 6 columns and 3 rows in the dataset. The columns are coloured according to the datatype of each column." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Try out the .concat() function - pass in a list of dataframes to be concatenated\n", "pd.concat( [sample1, sample2] )" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That should have worked okay... We can visualise what has happened by colouring the rows that have been joined from each dataframe." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#The pBlockGrid is still all experimental - we should pass a list here for example...\n", "df_concat,b=pb.pBlockGrid5(sample1, sample2)\n", "b" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows us that the concatenated table has 6 columns and 6 rows, with 3 rows from the first table and 3 rows from the second." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens when we try to merge to two complete dataframes?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bldg_2009_11=pd.concat([bldg_2009_10,bldg_2010_11])\n", "#Check to see if the dataframes appear to have been concatenated together by inspecting row counts\n", "print(len(bldg_2009_10),len(bldg_2010_11),len(bldg_2009_11))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if the dataframes have the same column names, but they appear in a different order?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Create a sample dataframe containing the same columns as the original but in a different order\n", "sample3=bldg_2009_10[['Reference area','All','Housing-Associations','http://opendatacommunities.org/def/ontology/geography/refArea','Local-Authority','Private-Enterprise']][:3]\n", "sample3" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Let's look at the shape and datatype structure of sample1\n", "pb.pBlockGrid3(sample1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#And sample2? We see the columns are differently ordered in terms of datatype\n", "pb.pBlockGrid3(sample3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Concatenate some sample dataframes with the same columns, but differently ordered\n", "concat_difforder = pd.concat([sample1,sample3])\n", "concat_difforder" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#What is the datatype structure of the resulting concatenated datatable?\n", "pb.pBlockGrid3(concat_difforder)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas is capable of automatically aligning the columns from such data frames.\n", "\n", "What happens if we try to concatenate dataframes in which the dataframes only partially share columns?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Create a sample dataframe that contains only a subset of the columns from an original dataframe\n", "sample4=bldg_2009_10[['Reference area','All','Housing-Associations']][:3]\n", "sample4" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Concatenate two data frames with non-identical columns\n", "concat_diffcolumns = pd.concat([sample1,sample4])\n", "concat_diffcolumns " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can colour any cells that have missing values to show the 'holes' in the resulting dataframe:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Here I am going to \"colour\" the missing cells white.\n", "#The colour field accepts (red, green, blue) values in the range 0..255 each\n", "pb.pBlockGrid4(concat_diffcolumns,color_NA=(255,255,255))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `concat()` function aligns columns where it can. By default, the columns in the combined dataframe are the superset of distinctly named columns in the concatenated data frame. Missing values are given a `NaN` value.\n", "\n", "This form of concatenation is an *outer join* in the sense that we are producing a set of columns in the output that represent the union of columns contained in the concatenated datasets.\n", "\n", "The `concat()` functions uses the outer style join by default, but we can also force it to adopt an *inner join* behaviour in which the columns in the output data frame correspond to the *intersection* of columns from the input data frames compared to the *union* of the *outer join*." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Explicitly use an INNER join (\"inner\") on the concatenation; \"outer\" is the default value\n", "concat_inner = pd.concat([sample1,sample4],join='inner')\n", "concat_inner" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#pBlockGrid TO DO\n", "pb.pBlockGrid5(sample1,sample4,join='inner')[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise:\n", "What problems, if any, can you see in interpreting the data in any of the concatenated data sets produced above, and how might they be resolved? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Answer: although the data items represent reports from different years, we have lost that information. The year the reports refer to are not encoded in the actual rows of data. If we add an additional column to each dataset as it is loaded in that contains the year the report relates to, we can carry that information in to the concatenated data set." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Adding in the Lost MetaData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So how might we add in an additional data column that identifies the period the data relates to before we concatenate the separate data frames?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#YOUR ATTEMPT HERE" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how I did it: \n", "\n", "\n", "\n", "\n", "\n", "
`bldg_2009_10['Period']=\"2009-10\"\n", "bldg_2009_10[:3]`
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Now add a Period column to each annual building dataframe that contains an appropriate value" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Create a single dataframe containing all the housebuilding data distinguishable by period" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Horizontally Joining Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By inspection of the building start data and the lettings data, we see that data elements are keyed on common columns - geographical reference area codes, and names." ] }, { "cell_type": "code", "collapsed": false, "input": [ "bldg_2012_13 = pd.read_csv('data/housingdata/house-building-starts-tenure-2012-2013.csv',skiprows=5)\n", "bldgSample=bldg_2012_13[:3]\n", "bldgSample" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "lettings_2012_13=pd.read_csv('data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv',skiprows=5)\n", "lettingsSample=lettings_2012_13[:3]\n", "lettingsSample" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is straightforward to merge the tables horizontally using the pandas `.merge()` function. The first two arguments specify the datatables to be merged. Where the columns that act as the focus for merging share the same name, we can specify them in a list assigned to the `on` parameter." ] }, { "cell_type": "code", "collapsed": false, "input": [ "simplemerge=pd.merge(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'])\n", "simplemerge" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualise the effects of the merge by colouring columns according to whether they are the merge key columns (green), left table (red) or right table (blue)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid6(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'])[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that the resulting dataframe has 11 columns, with 2 key columns in green, 4 uniquely contributed columns from the left table in red and 5 uniquely contributed columns from the right table in blue." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we could have also have merged the dataframes on a single column. In this case, duplicate columns are brought in to the merged result separately, and pandas automatically appends a suffix to each one so it remains uniquely labelled in the resulting data frame (so for example we get *Reference area_x* and *Reference area_y* in the result)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid6(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea'])[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time we have only a single key, with 5 uniquely named columns from the left table and 6 from the right." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the column names are differently labelled, we can specify them explicitly for each datatable." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Rename one of the merge columns in one table\n", "lettingsSample.columns=['Ref Area Code']+lettingsSample.columns[1:].tolist()\n", "lettingsSample" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can explicitly declare the columns we want to merge on from each table using the *left_on* and *right_on* parameters (I find this confusing, and would have expected *on_left* and *on_right*). For the merge to work, these parameters need to identify the same number of columns in same order." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge( bldgSample, lettingsSample, \n", " left_on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " right_on=['Ref Area Code','Reference area'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#pBlockGrid - TO DO - cope with left/right keys; colour them separately?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Inner Joins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default behaviour of merge is an inner join (`how='inner'`) where the results table is formed from the intersection of the joined key column values. Consider the example where one table has additional rows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bldgSample_long=bldg_2012_13[:4] #4 rows compared to 3 in the lettings sample\n", "bldgSample_long.columns=['Ref Area Code']+bldgSample_long.columns[1:].tolist()\n", "\n", "pd.merge(bldgSample_long,lettingsSample,on=['Ref Area Code','Reference area'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid6(bldgSample_long,lettingsSample,on=['Ref Area Code','Reference area'])[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use *pandasql* to achieve an inner join using the following form of `SELECT` statement:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandasql import sqldf\n", "pysqldf = lambda q: sqldf(q, globals())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#At the moment, we need to hack our way round dataframe column names that don't work as sqlite3 column names.\n", "bldgSample_sqldf=bldgSample.copy()\n", "bldgSample_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)\n", "\n", "lettingsSample_sqldf=lettingsSample.copy()\n", "lettingsSample_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)\n", "\n", "\n", "q='''\n", " SELECT * FROM bldgSample_sqldf INNER JOIN lettingsSample_sqldf\n", " ON bldgSample_sqldf.Reference_Area = lettingsSample_sqldf.Reference_Area ;\n", " '''\n", "pysqldf(q)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where the column names being used to match row items are the same across dataframe, we can use the following alterbative syntax:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "q='''\n", " SELECT * FROM bldgSample_sqldf INNER JOIN lettingsSample_sqldf\n", " USING(Reference_Area) ;\n", " '''\n", "pysqldf(q)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Left Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a *left join* we use all the columns from the left table, and matched ones from the right.\n", "\n", "Let's generate a long sample from the lettings data but include some different reference areas compared to the building start data, by taking data from the top and the bottom of the original data frame." ] }, { "cell_type": "code", "collapsed": false, "input": [ "lettingsSample_long=pd.concat( [ lettings_2012_13[:2], lettings_2012_13[-2:] ])\n", "lettingsSample_long" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#What happens if you try to inner join bldgSample and lettingsSample_long ?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now try a left join, by setting `how='left'`. What happens to the columns from the right hand table for the unmatched rows from the left column?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge( bldgSample, lettingsSample_long,\n", " on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " how='left')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid6( bldgSample, lettingsSample_long,\n", " on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " how='left')[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see the two key columns, 4 unique columns from the left table and 5 unique columns from the right. The final row shows only missing values in unique right columns - maybe there was no match on the corresponding key values?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bldgSample_sqldf=bldgSample.copy()\n", "bldgSample_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)\n", "\n", "lettingsSample_long_sqldf=lettingsSample_long.copy()\n", "lettingsSample_long_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)\n", "\n", "q='''\n", " SELECT * FROM bldgSample_sqldf LEFT JOIN lettingsSample_long_sqldf\n", " USING(Reference_Area);\n", " '''\n", "pysqldf(q)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Right Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsurprisingly, a right join is achieved by setting `how='right'`. What happens to the columns from the left hand table for the unmatched rows from the right column?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge( bldgSample, lettingsSample_long,\n", " on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " how='right')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid6( bldgSample, lettingsSample_long,\n", " on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " how='right')[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although standard SQL *does* define `RIGHT JOIN` *pandasql* **does not** implement a `RIGHT JOIN`. However, we could simularte it by swapping the order of the datarames/tables passed to the join and using a `LEFT JOIN` instead." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Outer Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A full outer join can be achieved by setting `how='outer'`. What happens to the unmatched rows from each table?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge( bldgSample, lettingsSample_long,\n", " on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " how='outer')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pb.pBlockGrid6( bldgSample, lettingsSample_long,\n", " on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],\n", " how='outer')[1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although standard SQL *does* define `FULL OUTER JOIN` *pandasql* **does not** implement it." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "What Happens With a One-Many Relationship?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's generate a sample data frame that has several rows containing the same reference area so we ca explore what happens when we try to merge a dataframe with with *one* unique reference area per row with a dataframe where there may be multiple rows matching the key values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "bldg_sample_mixed = pd.concat([ bldg_2009_10[:2], bldg_2012_13[:2] ])\n", "bldg_sample_mixed" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens for the various joins (*inner*, *left*, *right*, *outer*) when applied to `bldg_sample_mixed` and `lettingsSample_long`?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your experiments here... inner join on bldg_sample_mixed and lettingsSample_long" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "##I'm not sure the pBlocks stuff handles this sensible yet? What would we want the blocks to display?" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your experiments here... left join on bldg_sample_mixed and lettingsSample_long" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your experiments here... right join on bldg_sample_mixed and lettingsSample_long" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your experiments here... outer join on bldg_sample_mixed and lettingsSample_long" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
??Add in some exercises here with canned data as well as setting students challenge of finding some data of their own and merging it together?\n", "\n", "Get them to reflect on challenges, issues and possible problems?
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Merging Data Tables Where One Key Column Represents a Unique Part of Another" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where a common identifier scheme is used to identify the same element or entity that is represented in several datasets, it is easy enough to merge the datasets using the column that contains the common identifiers. In the above examples, we were able to merge data about housing build starts and letting prices across UK administrative areas using the the reference area names and/or codes.\n", "\n", "In some cases, however, the identifiers used in one dataset may only partially match the identifiers in another. Sometimes, it is possible for us to recreate the identifiers used in one scheme from the identifiers used in another. For example, given a reference area code such as *E06000001* I can generate the identifier `http://statistics.data.gov.uk/id/statistical-geography/E06000001`. The identifier takes the pattern `http://statistics.data.gov.uk/id/statistical-geography/AREACODE` so given an `AREACODE` we can recreate the identifier.\n", "\n", "At other times, the partial match may be more pernicious. For example, is \"Open Uni\" the same as \"Open University\"? More involved processes are required to cope with such considereations, which we will ignore for now." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Well Behaved Partial Matches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data file `housingdata/yorksAndHumberside.csv` contains a list of administrative areas in the Yorkshire and Humberside adminstrative area. There are three columns in the dataset, taking the form `http://data.ordnancesurvey.co.uk/id/7000000000022028, NorthYorkshire, E10000023`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.read_csv('data/housingdata/yorksAndHumberside.csv')[15:20]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These contrast with the identifiers used in the building start and lettings price datasets, which take the form `http://statistics.data.gov.uk/id/statistical-geography/E06000002,E06000002 Middlesbrough`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise:\n", "\n", "If the Yorkshire and Humber area dataset used the same identifier scheme as the housing start data, how could you generate a dats set that just listed the housing starts for the 2012-13 period in that area?\n", "\n", "In what ways can you generate a common administrative area identifier scheme for the housing start data and the Yorkshire and Humber area data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "
I can see several ways of generating common identifiers:\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise\n", "\n", "Generate a data table that contains the 2012-13 housing data for just those areas whose area codes are in the Yorkshire and Humber region.\n", "\n", "Which area in that region has the largest total number of housing starts? Which are has the largest number of Housing Association starts?" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data tables can be joined *vertically* in *pandas* using the `.concat()` function and horizontally using `.merge()`. Inner, out, left and right joins are all supported by `.merge()`, and inner and outer joins by `.concat()`.\n", "\n", "In *pandasql*, we can use the `JOIN` operator, suitably modified, to provide a similar range of merge operations." ] } ], "metadata": {} } ] }