{
 "metadata": {
  "name": "",
  "signature": "sha256:7f0b4427556f9d9c59904b2d9364da550ac1ffad2eb2bcd098951a4d185133df"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "UNdata Informal API"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The UNdata website offers an [official API](http://data.un.org/Host.aspx?Content=API) but it doesn't look overly welcoming to someone not versed in the XML protocol it supports. So here's a hacked solution based on scraping a websearch that let's you search the site for datasets, and then download the one you want as a zipped CSV file that gets automatically parsed into a *pandas* dataframe.\n",
      "\n",
      "The UN data search form lets you download data directly from the results page:\n",
      "\n",
      "![UN Data search form](https://farm4.staticflickr.com/3899/14739472598_1226c17a04_c.jpg)\n",
      "\n",
      "So let's write a simple scraper to grab the results and see if you can download a selected ata file automatically..."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we *View Source* on the results page we can look for the individual results items - and see what we neeed to parse out.\n",
      "\n",
      "![UN data search results - view source](https://farm4.staticflickr.com/3914/14738974149_486ff75e3a_c.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We also need to have a look at what form the HTTP request for a data download looks like to make sure we get what we need when we do scrape the results...\n",
      "\n",
      "![UN Data results page - with data download request details](https://farm6.staticflickr.com/5557/14922406411_c125576dd4_c.jpg)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Load in some libraries to handle the web page requests and the web page parsing...\n",
      "import requests\n",
      "\n",
      "#You may need to install BeautifulSoup\n",
      "#!pip3 install beautifulsoup4\n",
      "from bs4 import BeautifulSoup\n",
      "\n",
      "#Note - I'm in Python3\n",
      "from urllib.parse import parse_qs\n",
      "\n",
      "#The scraper will be limited to just the first results page...\n",
      "def searchUNdata(q):\n",
      "    ''' Run a search on the UN data website and scrape the results '''\n",
      "    \n",
      "    params={'q':q}\n",
      "    url='http://data.un.org/Search.aspx'\n",
      "\n",
      "    response = requests.get(url,params=params)\n",
      "\n",
      "    soup=BeautifulSoup(response.content)\n",
      "\n",
      "    results={}\n",
      "\n",
      "    #Get the list of results\n",
      "    searchresults=soup.findAll('div',{'class':'Result'})\n",
      "    \n",
      "    #For each result, parse out the name of the dataset, the datamart ID and the data filter ID\n",
      "    for result in searchresults:\n",
      "        h2=result.find('h2')\n",
      "        #We can find everything we need in the <a> tag...\n",
      "        a=h2.find('a')\n",
      "        p=parse_qs(a.attrs['href'])\n",
      "        results[a.text]=(p['d'][0],p['f'][0])\n",
      "\n",
      "    return results"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#A couple of helper functions to let us display the results\n",
      "\n",
      "results=searchUNdata('carbon dioxide')\n",
      "\n",
      "def printResults(results):\n",
      "    ''' Nicely print the search results '''\n",
      "    \n",
      "    for result in results.keys():\n",
      "        print(result)\n",
      "\n",
      "\n",
      "def unDataSearch(q):\n",
      "    ''' Simple function to take a search phrase, run the search on the UN data site, and print and return the results. '''\n",
      "    \n",
      "    results=searchUNdata(q)\n",
      "    printResults(results)\n",
      "    return results\n",
      "\n",
      "printResults(results)\n",
      "\n",
      "#q='carbon dioxide'\n",
      "#unDataSearch(q)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)\n",
        "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (UNFCCC)\n",
        "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (CDIAC)\n",
        "Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in Gigagrams (Gg)\n",
        "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (CDIAC)\n",
        "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)\n",
        "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (UNFCCC)\n",
        "Trade of goods , US$, HS 1992, 28 Inorganic chemicals, precious metal compound, isotope\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Just in case - a helper routine for working with the search results data\n",
      "def search(d, substr):\n",
      "    ''' Partial string match search within dict key names '''\n",
      "    #via http://stackoverflow.com/a/10796050/454773\n",
      "    \n",
      "    result = []\n",
      "    for key in d:\n",
      "        if substr.lower() in key.lower():\n",
      "            result.append((key, d[key])) \n",
      "\n",
      "    return result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "search(results, 'per capita')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "[('Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)',\n",
        "  ('MDG', 'seriesRowID:751')),\n",
        " ('Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)',\n",
        "  ('MDG', 'seriesRowID:752'))]"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Note - I'm in Python3\n",
      "from io import BytesIO\n",
      "\n",
      "import zipfile\n",
      "import pandas as pd\n",
      "\n",
      "def getUNdata(undataSearchResults,dataset):\n",
      "    ''' Download a named dataset from the UN Data website and load it into a pandas dataframe '''\n",
      "\n",
      "    datamartID,seriesRowID=undataSearchResults[dataset]\n",
      "    \n",
      "    url='http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter='+seriesRowID+'&DataMartId='+datamartID+'&Format=csv'\n",
      "\n",
      "    r = requests.get(url)\n",
      "    \n",
      "    \n",
      "    s=BytesIO(r.content)\n",
      "    z = zipfile.ZipFile(s)\n",
      "    \n",
      "    #Show the files in the zip file\n",
      "    #z.namelist()\n",
      "    \n",
      "    #Let's assume we just get one file per zip...\n",
      "    #Drop any all blank columns\n",
      "    df=pd.read_csv( BytesIO( z.read( z.namelist()[0] ) )).dropna(axis=1,how='all')\n",
      "    return df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "results=unDataSearch('carbon dioxide')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)\n",
        "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (UNFCCC)\n",
        "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (CDIAC)\n",
        "Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in Gigagrams (Gg)\n",
        "Carbon dioxide emissions (CO2), thousand metric tons of CO2 (CDIAC)\n",
        "Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)\n",
        "Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (UNFCCC)\n",
        "Trade of goods , US$, HS 1992, 28 Inorganic chemicals, precious metal compound, isotope\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dd=getUNdata(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)')\n",
      "\n",
      "#Preview the last few rows\n",
      "dd[-5:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Country or Area</th>\n",
        "      <th>Year</th>\n",
        "      <th>Value</th>\n",
        "      <th>Value Footnotes</th>\n",
        "      <th>Value Footnotes.1</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>922</th>\n",
        "      <td> United States</td>\n",
        "      <td>                                              1991</td>\n",
        "      <td> 19.357277</td>\n",
        "      <td>  1</td>\n",
        "      <td>  1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>923</th>\n",
        "      <td> United States</td>\n",
        "      <td>                                              1990</td>\n",
        "      <td> 19.746756</td>\n",
        "      <td>  1</td>\n",
        "      <td>  1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>924</th>\n",
        "      <td>           NaN</td>\n",
        "      <td>                                               NaN</td>\n",
        "      <td>       NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>925</th>\n",
        "      <td> footnoteSeqID</td>\n",
        "      <td>                                          Footnote</td>\n",
        "      <td>       NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>926</th>\n",
        "      <td>             1</td>\n",
        "      <td> For Denmark, France, United Kingdom and United...</td>\n",
        "      <td>       NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "    Country or Area                                               Year  \\\n",
        "922   United States                                               1991   \n",
        "923   United States                                               1990   \n",
        "924             NaN                                                NaN   \n",
        "925   footnoteSeqID                                           Footnote   \n",
        "926               1  For Denmark, France, United Kingdom and United...   \n",
        "\n",
        "         Value  Value Footnotes  Value Footnotes.1  \n",
        "922  19.357277                1                  1  \n",
        "923  19.746756                1                  1  \n",
        "924        NaN              NaN                NaN  \n",
        "925        NaN              NaN                NaN  \n",
        "926        NaN              NaN                NaN  "
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#One thing to note is that footnotes may appear at the bottom of a dataframe\n",
      "#We can spot the all empty row and drop rows from that\n",
      "#We can also drop the footnote related columns\n",
      "def dropFootnotes(df):\n",
      "    return df[:pd.isnull(dd).all(1).nonzero()[0][0]].drop(['Value Footnotes','Value Footnotes.1'], 1)\n",
      "\n",
      "dropFootnotes(dd)[-5:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Country or Area</th>\n",
        "      <th>Year</th>\n",
        "      <th>Value</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>919</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1994</td>\n",
        "      <td> 19.903438</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>920</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1993</td>\n",
        "      <td> 19.788616</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>921</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1992</td>\n",
        "      <td> 19.568633</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>922</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1991</td>\n",
        "      <td> 19.357277</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>923</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1990</td>\n",
        "      <td> 19.746756</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "    Country or Area  Year      Value\n",
        "919   United States  1994  19.903438\n",
        "920   United States  1993  19.788616\n",
        "921   United States  1992  19.568633\n",
        "922   United States  1991  19.357277\n",
        "923   United States  1990  19.746756"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Create a function that automatically drops the footnotes and any empty rows\n",
      "def getUNdata2(undataSearchResults, dataset, footnotes=False):\n",
      "    df=getUNdata(undataSearchResults, dataset)\n",
      "    if footnotes:\n",
      "        return df\n",
      "    return dropFootnotes(df)\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "getUNdata2(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)')[-5:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Country or Area</th>\n",
        "      <th>Year</th>\n",
        "      <th>Value</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>919</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1994</td>\n",
        "      <td> 19.903438</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>920</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1993</td>\n",
        "      <td> 19.788616</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>921</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1992</td>\n",
        "      <td> 19.568633</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>922</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1991</td>\n",
        "      <td> 19.357277</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>923</th>\n",
        "      <td> United States</td>\n",
        "      <td> 1990</td>\n",
        "      <td> 19.746756</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 24,
       "text": [
        "    Country or Area  Year      Value\n",
        "919   United States  1994  19.903438\n",
        "920   United States  1993  19.788616\n",
        "921   United States  1992  19.568633\n",
        "922   United States  1991  19.357277\n",
        "923   United States  1990  19.746756"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "getUNdata2(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)',footnotes=True)[-5:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Country or Area</th>\n",
        "      <th>Year</th>\n",
        "      <th>Value</th>\n",
        "      <th>Value Footnotes</th>\n",
        "      <th>Value Footnotes.1</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>922</th>\n",
        "      <td> United States</td>\n",
        "      <td>                                              1991</td>\n",
        "      <td> 19.357277</td>\n",
        "      <td>  1</td>\n",
        "      <td>  1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>923</th>\n",
        "      <td> United States</td>\n",
        "      <td>                                              1990</td>\n",
        "      <td> 19.746756</td>\n",
        "      <td>  1</td>\n",
        "      <td>  1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>924</th>\n",
        "      <td>           NaN</td>\n",
        "      <td>                                               NaN</td>\n",
        "      <td>       NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>925</th>\n",
        "      <td> footnoteSeqID</td>\n",
        "      <td>                                          Footnote</td>\n",
        "      <td>       NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>926</th>\n",
        "      <td>             1</td>\n",
        "      <td> For Denmark, France, United Kingdom and United...</td>\n",
        "      <td>       NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "    Country or Area                                               Year  \\\n",
        "922   United States                                               1991   \n",
        "923   United States                                               1990   \n",
        "924             NaN                                                NaN   \n",
        "925   footnoteSeqID                                           Footnote   \n",
        "926               1  For Denmark, France, United Kingdom and United...   \n",
        "\n",
        "         Value  Value Footnotes  Value Footnotes.1  \n",
        "922  19.357277                1                  1  \n",
        "923  19.746756                1                  1  \n",
        "924        NaN              NaN                NaN  \n",
        "925        NaN              NaN                NaN  \n",
        "926        NaN              NaN                NaN  "
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Summary"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook demonstrates a simple, informal scraper based API to the UN data website. Searches can be run on the UN data website to obtain a list of named datasets, and then a specified named dataset can be automatically downloaded into a *pandas* dataframe."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}