{ "metadata": { "name": "", "signature": "sha256:7f0b4427556f9d9c59904b2d9364da550ac1ffad2eb2bcd098951a4d185133df" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "UNdata Informal API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The UNdata website offers an [official API](http://data.un.org/Host.aspx?Content=API) but it doesn't look overly welcoming to someone not versed in the XML protocol it supports. So here's a hacked solution based on scraping a websearch that let's you search the site for datasets, and then download the one you want as a zipped CSV file that gets automatically parsed into a *pandas* dataframe.\n", "\n", "The UN data search form lets you download data directly from the results page:\n", "\n", "![UN Data search form](https://farm4.staticflickr.com/3899/14739472598_1226c17a04_c.jpg)\n", "\n", "So let's write a simple scraper to grab the results and see if you can download a selected ata file automatically..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we *View Source* on the results page we can look for the individual results items - and see what we neeed to parse out.\n", "\n", "![UN data search results - view source](https://farm4.staticflickr.com/3914/14738974149_486ff75e3a_c.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need to have a look at what form the HTTP request for a data download looks like to make sure we get what we need when we do scrape the results...\n", "\n", "![UN Data results page - with data download request details](https://farm6.staticflickr.com/5557/14922406411_c125576dd4_c.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Load in some libraries to handle the web page requests and the web page parsing...\n", "import requests\n", "\n", "#You may need to install BeautifulSoup\n", "#!pip3 install beautifulsoup4\n", "from bs4 import BeautifulSoup\n", "\n", "#Note - I'm in Python3\n", "from urllib.parse import parse_qs\n", "\n", "#The scraper will be limited to just the first results page...\n", "def searchUNdata(q):\n", " ''' Run a search on the UN data website and scrape the results '''\n", " \n", " params={'q':q}\n", " url='http://data.un.org/Search.aspx'\n", "\n", " response = requests.get(url,params=params)\n", "\n", " soup=BeautifulSoup(response.content)\n", "\n", " results={}\n", "\n", " #Get the list of results\n", " searchresults=soup.findAll('div',{'class':'Result'})\n", " \n", " #For each result, parse out the name of the dataset, the datamart ID and the data filter ID\n", " for result in searchresults:\n", " h2=result.find('h2')\n", " #We can find everything we need in the tag...\n", " a=h2.find('a')\n", " p=parse_qs(a.attrs['href'])\n", " results[a.text]=(p['d'][0],p['f'][0])\n", "\n", " return results" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "#A couple of helper functions to let us display the results\n", "\n", "results=searchUNdata('carbon dioxide')\n", "\n", "def printResults(results):\n", " ''' Nicely print the search results '''\n", " \n", " for result in results.keys():\n", " print(result)\n", "\n", "\n", "def unDataSearch(q):\n", " ''' Simple function to take a search phrase, run the search on the UN data site, and print and return the results. '''\n", " \n", " results=searchUNdata(q)\n", " printResults(results)\n", " return results\n", "\n", "printResults(results)\n", "\n", "#q='carbon dioxide'\n", "#unDataSearch(q)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)\n", "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (UNFCCC)\n", "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (CDIAC)\n", "Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in Gigagrams (Gg)\n", "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (CDIAC)\n", "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)\n", "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (UNFCCC)\n", "Trade of goods , US$, HS 1992, 28 Inorganic chemicals, precious metal compound, isotope\n" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "#Just in case - a helper routine for working with the search results data\n", "def search(d, substr):\n", " ''' Partial string match search within dict key names '''\n", " #via http://stackoverflow.com/a/10796050/454773\n", " \n", " result = []\n", " for key in d:\n", " if substr.lower() in key.lower():\n", " result.append((key, d[key])) \n", "\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "search(results, 'per capita')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "[('Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)',\n", " ('MDG', 'seriesRowID:751')),\n", " ('Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)',\n", " ('MDG', 'seriesRowID:752'))]" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "#Note - I'm in Python3\n", "from io import BytesIO\n", "\n", "import zipfile\n", "import pandas as pd\n", "\n", "def getUNdata(undataSearchResults,dataset):\n", " ''' Download a named dataset from the UN Data website and load it into a pandas dataframe '''\n", "\n", " datamartID,seriesRowID=undataSearchResults[dataset]\n", " \n", " url='http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter='+seriesRowID+'&DataMartId='+datamartID+'&Format=csv'\n", "\n", " r = requests.get(url)\n", " \n", " \n", " s=BytesIO(r.content)\n", " z = zipfile.ZipFile(s)\n", " \n", " #Show the files in the zip file\n", " #z.namelist()\n", " \n", " #Let's assume we just get one file per zip...\n", " #Drop any all blank columns\n", " df=pd.read_csv( BytesIO( z.read( z.namelist()[0] ) )).dropna(axis=1,how='all')\n", " return df" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "results=unDataSearch('carbon dioxide')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)\n", "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (UNFCCC)\n", "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (CDIAC)\n", "Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in Gigagrams (Gg)\n", "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (CDIAC)\n", "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)\n", "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (UNFCCC)\n", "Trade of goods , US$, HS 1992, 28 Inorganic chemicals, precious metal compound, isotope\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "dd=getUNdata(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)')\n", "\n", "#Preview the last few rows\n", "dd[-5:]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country or AreaYearValueValue FootnotesValue Footnotes.1
922 United States 1991 19.357277 1 1
923 United States 1990 19.746756 1 1
924 NaN NaN NaNNaNNaN
925 footnoteSeqID Footnote NaNNaNNaN
926 1 For Denmark, France, United Kingdom and United... NaNNaNNaN
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " Country or Area Year \\\n", "922 United States 1991 \n", "923 United States 1990 \n", "924 NaN NaN \n", "925 footnoteSeqID Footnote \n", "926 1 For Denmark, France, United Kingdom and United... \n", "\n", " Value Value Footnotes Value Footnotes.1 \n", "922 19.357277 1 1 \n", "923 19.746756 1 1 \n", "924 NaN NaN NaN \n", "925 NaN NaN NaN \n", "926 NaN NaN NaN " ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "#One thing to note is that footnotes may appear at the bottom of a dataframe\n", "#We can spot the all empty row and drop rows from that\n", "#We can also drop the footnote related columns\n", "def dropFootnotes(df):\n", " return df[:pd.isnull(dd).all(1).nonzero()[0][0]].drop(['Value Footnotes','Value Footnotes.1'], 1)\n", "\n", "dropFootnotes(dd)[-5:]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country or AreaYearValue
919 United States 1994 19.903438
920 United States 1993 19.788616
921 United States 1992 19.568633
922 United States 1991 19.357277
923 United States 1990 19.746756
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ " Country or Area Year Value\n", "919 United States 1994 19.903438\n", "920 United States 1993 19.788616\n", "921 United States 1992 19.568633\n", "922 United States 1991 19.357277\n", "923 United States 1990 19.746756" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "#Create a function that automatically drops the footnotes and any empty rows\n", "def getUNdata2(undataSearchResults, dataset, footnotes=False):\n", " df=getUNdata(undataSearchResults, dataset)\n", " if footnotes:\n", " return df\n", " return dropFootnotes(df)\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "getUNdata2(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)')[-5:]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country or AreaYearValue
919 United States 1994 19.903438
920 United States 1993 19.788616
921 United States 1992 19.568633
922 United States 1991 19.357277
923 United States 1990 19.746756
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ " Country or Area Year Value\n", "919 United States 1994 19.903438\n", "920 United States 1993 19.788616\n", "921 United States 1992 19.568633\n", "922 United States 1991 19.357277\n", "923 United States 1990 19.746756" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "getUNdata2(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)',footnotes=True)[-5:]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country or AreaYearValueValue FootnotesValue Footnotes.1
922 United States 1991 19.357277 1 1
923 United States 1990 19.746756 1 1
924 NaN NaN NaNNaNNaN
925 footnoteSeqID Footnote NaNNaNNaN
926 1 For Denmark, France, United Kingdom and United... NaNNaNNaN
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ " Country or Area Year \\\n", "922 United States 1991 \n", "923 United States 1990 \n", "924 NaN NaN \n", "925 footnoteSeqID Footnote \n", "926 1 For Denmark, France, United Kingdom and United... \n", "\n", " Value Value Footnotes Value Footnotes.1 \n", "922 19.357277 1 1 \n", "923 19.746756 1 1 \n", "924 NaN NaN NaN \n", "925 NaN NaN NaN \n", "926 NaN NaN NaN " ] } ], "prompt_number": 23 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates a simple, informal scraper based API to the UN data website. Searches can be run on the UN data website to obtain a list of named datasets, and then a specified named dataset can be automatically downloaded into a *pandas* dataframe." ] } ], "metadata": {} } ] }