{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import cs109style\n", "cs109style.customize_mpl()\n", "cs109style.customize_css()\n", "\n", "# special IPython command to prepare the notebook for matplotlib\n", "%matplotlib inline \n", "\n", "from collections import defaultdict\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import requests\n", "from pattern import web\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching population data from Wikipedia\n", "\n", "In this example we will fetch data about countries and their population from Wikipedia.\n", "\n", "http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population has several tables for individual countries, subcontinents as well as different years. We will combine the data for all countries and all years in a single panda dataframe and visualize the change in population for different countries.\n", "\n", "###We will go through the following steps:\n", "* fetching html with embedded data\n", "* parsing html to extract the data\n", "* collecting the data in a panda dataframe\n", "* displaying the data\n", "\n", "To give you some starting points for your homework, we will also show the different sub-steps that can be taken to reach the presented solution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching the Wikipedia site" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url = 'http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population'\n", "website_html = requests.get(url).text\n", "#print website_html" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing html data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_population_html_tables(html):\n", " \"\"\"Parse html and return html tables of wikipedia population data.\"\"\"\n", "\n", " dom = web.Element(html)\n", "\n", " ### 0. step: look at html source!\n", " \n", " #### 1. step: get all tables\n", "\n", " #### 2. step: get all tables we care about\n", "\n", " return tbls\n", "\n", "tables = get_population_html_tables(website_html)\n", "print \"table length: %d\" %len(tables)\n", "for t in tables:\n", " print t.attributes\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def table_type(tbl):\n", " ### Extract the table type\n", "\n", "# group the tables by type\n", "tables_by_type = defaultdict(list) # defaultdicts have a default value that is inserted when a new key is accessed\n", "for tbl in tables:\n", " tables_by_type[table_type(tbl)].append(tbl)\n", "\n", "print tables_by_type" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting data and filling it into a dictionary" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_countries_population(tables):\n", " \"\"\"Extract population data for countries from all tables and store it in dictionary.\"\"\"\n", " \n", " result = defaultdict(dict)\n", "\n", " # 1. step: try to extract data for a single table\n", "\n", " # 2. step: iterate over all tables, extract headings and actual data and combine data into single dict\n", " \n", " return result\n", "\n", "\n", "result = get_countries_population(tables_by_type['Country or territory'])\n", "print result" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a dataframe from a dictionary" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create dataframe\n", "\n", "df = pd.DataFrame.from_dict(result, orient='index')\n", "# sort based on year\n", "df.sort(axis=1,inplace=True)\n", "print df\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some data accessing functions for a panda dataframe" ] }, { "cell_type": "code", "collapsed": false, "input": [ "subtable = df.iloc[0:2, 0:2]\n", "print \"subtable\"\n", "print subtable\n", "print \"\"\n", "\n", "column = df[1955]\n", "print \"column\"\n", "print column\n", "print \"\"\n", "\n", "row = df.ix[0] #row 0\n", "print \"row\"\n", "print row\n", "print \"\"\n", "\n", "rows = df.ix[:2] #rows 0,1\n", "print \"rows\"\n", "print rows\n", "print \"\"\n", "\n", "element = df.ix[0,1955] #element\n", "print \"element\"\n", "print element\n", "print \"\"\n", "\n", "# max along column\n", "print \"max\"\n", "print df[1950].max()\n", "print \"\"\n", "\n", "# axes\n", "print \"axes\"\n", "print df.axes\n", "print \"\"\n", "\n", "row = df.ix[0]\n", "print \"row info\"\n", "print row.name\n", "print row.index\n", "print \"\"\n", "\n", "countries = df.index\n", "print \"countries\"\n", "print countries\n", "print \"\"\n", "\n", "print \"Austria\"\n", "print df.ix['Austria']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting population of 4 countries" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plotCountries = ['Austria', 'Germany', 'United States', 'France']\n", " \n", "for country in plotCountries:\n", " row = df.ix[country]\n", " plt.plot(row.index, row, label=row.name ) \n", " \n", "plt.ylim(ymin=0) # start y axis at 0\n", "\n", "plt.xticks(rotation=70)\n", "plt.legend(loc='best')\n", "plt.xlabel(\"Year\")\n", "plt.ylabel(\"# people (million)\")\n", "plt.title(\"Population of countries\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot 5 most populous countries from 2010 and 2060" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_populous(df, year):\n", " # sort table depending on data value in year column\n", " df_by_year = df.sort(year, ascending=False)\n", " \n", " plt.figure()\n", " for i in range(5): \n", " row = df_by_year.ix[i]\n", " plt.plot(row.index, row, label=row.name ) \n", " \n", " plt.ylim(ymin=0)\n", " \n", " plt.xticks(rotation=70)\n", " plt.legend(loc='best')\n", " plt.xlabel(\"Year\")\n", " plt.ylabel(\"# people (million)\")\n", " plt.title(\"Most populous countries in %d\" % year)\n", "\n", "plot_populous(df, 2010)\n", "plot_populous(df, 2050)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }