{ "metadata": { "name": "", "signature": "sha256:d871c68cdef575e9c7d3ed4ccebc5c3bd613ebcec6f83e9101a9658a5cb36edb" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "# special IPython command to prepare the notebook for matplotlib\n", "%matplotlib inline \n", "\n", "import urllib2 # module to read in HTML\n", "import bs4 # BeautifulSoup: module to parse HTML and XML\n", "import json # \n", "import datetime as dt # module for manipulating dates and times\n", "import pandas as pd\n", "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recall from from lab last week 09/19/2014\n", "\n", "Previously discussed: \n", "\n", "* More pandas, matplotlib for exploratory data analysis\n", "* Brief introduction to numpy and scipy\n", "* Working on the command line\n", "* Overview of git and Github" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Today, we will discuss the following:\n", "\n", "* urllib2 - reads in HTML\n", "* BeautifulSoup - use to parse HTML and XML code\n", " * Reddit\n", "* JSON examples\n", " * World Cup\n", "\n", " Download this notebook from Github " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# urllib2\n", "\n", "[urllib2](https://docs.python.org/2/library/urllib2.html) is a useful module to get information about and retrieving data from the web. The function `urlopen()` opens a URL (similar to opening a file). The file-like object has some of the methods as a file object. For example, to read the entire HTML of the webpage into a single string, use the method `read()`. `readlines()` can read in the text line by line. While `read()` reads in the HTML code and and `close()` closes the URL connection. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = urllib2.urlopen(\"http://www.google.com\")\n", "htmlSource = x.read()\n", "x.close()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "type(htmlSource)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print htmlSource" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# BeautifulSoup\n", "\n", "Once you have the HTML source code, you have to parse it and clean it up.\n", "\n", "[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a really useful python module for parsing HTML and XML files. Let's try a few examples. \n", "\n", "For this section, we will be working with the HTML code from [Reddit](http://www.reddit.com). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = urllib2.urlopen(\"http://www.reddit.com\") # Opens URLS\n", "htmlSource = x.read()\n", "x.close()\n", "print htmlSource" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "### prettify()\n", "\n", "Beautiful Soup gives us a `BeautifulSoup` object, which represents the document as a nested data structure. We can use the `prettify()` function to show the different levels of the HTML code. " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "soup = bs4.BeautifulSoup(htmlSource)\n", "print soup.prettify()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Navigating the tree using tags\n", "\n", "The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `
` tag, just say `soup.head`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print soup.head.prettify()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .contents and .children\n", "\n", "A tag\u2019s children are available in a list called `.contents` which returns a list. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.head.contents" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(soup.head.contents)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Extract first three elements from the list of contents\n", "soup.head.contents[0:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of getting them as a list, you can iterate over a tag\u2019s children using the .children generator:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.head.children" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "for child in soup.head.children:\n", " print(child)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# print the title of reddit\n", "soup.head.title" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# print the string in the title\n", "soup.head.title.string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .descendants\n", "\n", "Attribute lets you iterate over all of a tag\u2019s children, recursively: its direct children, the children of its direct children, and so on:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for child in soup.head.descendants:\n", " print child" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .strings\n", "\n", "If there\u2019s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for string in soup.strings:\n", " print(repr(string))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .stripped_strings\n", "\n", "These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for string in soup.stripped_strings:\n", " print(repr(string))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .parent\n", "\n", "You can access an element\u2019s parent with the `.parent` attribute. In the example \u201cthree sisters\u201d document, the `` tag is the parent of the `\n", " | match_number | \n", "location | \n", "datetime | \n", "home_team | \n", "away_team | \n", "winner | \n", "home_team_events | \n", "away_team_events | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "Arena de Sao Paulo | \n", "2014-06-12T17:00:00.000-03:00 | \n", "{u'country': u'Brazil', u'code': u'BRA', u'goa... | \n", "{u'country': u'Croatia', u'code': u'CRO', u'go... | \n", "Brazil | \n", "[{u'type_of_event': u'goal-own', u'player': u'... | \n", "[{u'type_of_event': u'substitution-in', u'play... | \n", "
1 | \n", "2 | \n", "Estadio das Dunas | \n", "2014-06-13T13:00:00.000-03:00 | \n", "{u'country': u'Mexico', u'code': u'MEX', u'goa... | \n", "{u'country': u'Cameroon', u'code': u'CMR', u'g... | \n", "Mexico | \n", "[{u'type_of_event': u'yellow-card', u'player':... | \n", "[{u'type_of_event': u'substitution-in halftime... | \n", "
2 | \n", "3 | \n", "Arena Fonte Nova | \n", "2014-06-13T16:00:00.000-03:00 | \n", "{u'country': u'Spain', u'code': u'ESP', u'goal... | \n", "{u'country': u'Netherlands', u'code': u'NED', ... | \n", "Netherlands | \n", "[{u'type_of_event': u'goal-penalty', u'player'... | \n", "[{u'type_of_event': u'yellow-card', u'player':... | \n", "
3 | \n", "4 | \n", "Arena Pantanal | \n", "2014-06-13T19:00:00.000-03:00 | \n", "{u'country': u'Chile', u'code': u'CHI', u'goal... | \n", "{u'country': u'Australia', u'code': u'AUS', u'... | \n", "Chile | \n", "[{u'type_of_event': u'goal', u'player': u'Alex... | \n", "[{u'type_of_event': u'goal', u'player': u'Cahi... | \n", "
4 | \n", "5 | \n", "Estadio Mineirao | \n", "2014-06-14T13:00:00.000-03:00 | \n", "{u'country': u'Colombia', u'code': u'COL', u'g... | \n", "{u'country': u'Greece', u'code': u'GRE', u'goa... | \n", "Colombia | \n", "[{u'type_of_event': u'goal', u'player': u'P. A... | \n", "[{u'type_of_event': u'yellow-card', u'player':... | \n", "
\n", " | match_number | \n", "location | \n", "datetime | \n", "home_team | \n", "away_team | \n", "winner | \n", "home_team_events | \n", "away_team_events | \n", "gameDate | \n", "gameTime | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "Arena de Sao Paulo | \n", "2014-06-12T17:00:00.000-03:00 | \n", "{u'country': u'Brazil', u'code': u'BRA', u'goa... | \n", "{u'country': u'Croatia', u'code': u'CRO', u'go... | \n", "Brazil | \n", "[{u'type_of_event': u'goal-own', u'player': u'... | \n", "[{u'type_of_event': u'substitution-in', u'play... | \n", "2014-06-12 | \n", "20:00:00 | \n", "
1 | \n", "2 | \n", "Estadio das Dunas | \n", "2014-06-13T13:00:00.000-03:00 | \n", "{u'country': u'Mexico', u'code': u'MEX', u'goa... | \n", "{u'country': u'Cameroon', u'code': u'CMR', u'g... | \n", "Mexico | \n", "[{u'type_of_event': u'yellow-card', u'player':... | \n", "[{u'type_of_event': u'substitution-in halftime... | \n", "2014-06-13 | \n", "16:00:00 | \n", "
2 | \n", "3 | \n", "Arena Fonte Nova | \n", "2014-06-13T16:00:00.000-03:00 | \n", "{u'country': u'Spain', u'code': u'ESP', u'goal... | \n", "{u'country': u'Netherlands', u'code': u'NED', ... | \n", "Netherlands | \n", "[{u'type_of_event': u'goal-penalty', u'player'... | \n", "[{u'type_of_event': u'yellow-card', u'player':... | \n", "2014-06-13 | \n", "19:00:00 | \n", "
3 | \n", "4 | \n", "Arena Pantanal | \n", "2014-06-13T19:00:00.000-03:00 | \n", "{u'country': u'Chile', u'code': u'CHI', u'goal... | \n", "{u'country': u'Australia', u'code': u'AUS', u'... | \n", "Chile | \n", "[{u'type_of_event': u'goal', u'player': u'Alex... | \n", "[{u'type_of_event': u'goal', u'player': u'Cahi... | \n", "2014-06-13 | \n", "22:00:00 | \n", "
4 | \n", "5 | \n", "Estadio Mineirao | \n", "2014-06-14T13:00:00.000-03:00 | \n", "{u'country': u'Colombia', u'code': u'COL', u'g... | \n", "{u'country': u'Greece', u'code': u'GRE', u'goa... | \n", "Colombia | \n", "[{u'type_of_event': u'goal', u'player': u'P. A... | \n", "[{u'type_of_event': u'yellow-card', u'player':... | \n", "2014-06-14 | \n", "16:00:00 | \n", "