{ "metadata": { "name": "", "signature": "sha256:d871c68cdef575e9c7d3ed4ccebc5c3bd613ebcec6f83e9101a9658a5cb36edb" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "# special IPython command to prepare the notebook for matplotlib\n", "%matplotlib inline \n", "\n", "import urllib2 # module to read in HTML\n", "import bs4 # BeautifulSoup: module to parse HTML and XML\n", "import json # \n", "import datetime as dt # module for manipulating dates and times\n", "import pandas as pd\n", "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recall from from lab last week 09/19/2014\n", "\n", "Previously discussed: \n", "\n", "* More pandas, matplotlib for exploratory data analysis\n", "* Brief introduction to numpy and scipy\n", "* Working on the command line\n", "* Overview of git and Github" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Today, we will discuss the following:\n", "\n", "* urllib2 - reads in HTML\n", "* BeautifulSoup - use to parse HTML and XML code\n", " * Reddit\n", "* JSON examples\n", " * World Cup\n", "\n", " Download this notebook from Github " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# urllib2\n", "\n", "[urllib2](https://docs.python.org/2/library/urllib2.html) is a useful module to get information about and retrieving data from the web. The function `urlopen()` opens a URL (similar to opening a file). The file-like object has some of the methods as a file object. For example, to read the entire HTML of the webpage into a single string, use the method `read()`. `readlines()` can read in the text line by line. While `read()` reads in the HTML code and and `close()` closes the URL connection. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = urllib2.urlopen(\"http://www.google.com\")\n", "htmlSource = x.read()\n", "x.close()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "type(htmlSource)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print htmlSource" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# BeautifulSoup\n", "\n", "Once you have the HTML source code, you have to parse it and clean it up.\n", "\n", "[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a really useful python module for parsing HTML and XML files. Let's try a few examples. \n", "\n", "For this section, we will be working with the HTML code from [Reddit](http://www.reddit.com). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = urllib2.urlopen(\"http://www.reddit.com\") # Opens URLS\n", "htmlSource = x.read()\n", "x.close()\n", "print htmlSource" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "### prettify()\n", "\n", "Beautiful Soup gives us a `BeautifulSoup` object, which represents the document as a nested data structure. We can use the `prettify()` function to show the different levels of the HTML code. " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "soup = bs4.BeautifulSoup(htmlSource)\n", "print soup.prettify()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Navigating the tree using tags\n", "\n", "The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `` tag, just say `soup.head`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print soup.head.prettify()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .contents and .children\n", "\n", "A tag\u2019s children are available in a list called `.contents` which returns a list. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.head.contents" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(soup.head.contents)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Extract first three elements from the list of contents\n", "soup.head.contents[0:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of getting them as a list, you can iterate over a tag\u2019s children using the .children generator:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.head.children" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "for child in soup.head.children:\n", " print(child)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# print the title of reddit\n", "soup.head.title" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# print the string in the title\n", "soup.head.title.string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .descendants\n", "\n", "Attribute lets you iterate over all of a tag\u2019s children, recursively: its direct children, the children of its direct children, and so on:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for child in soup.head.descendants:\n", " print child" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .strings\n", "\n", "If there\u2019s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for string in soup.strings:\n", " print(repr(string))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .stripped_strings\n", "\n", "These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for string in soup.stripped_strings:\n", " print(repr(string))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .parent\n", "\n", "You can access an element\u2019s parent with the `.parent` attribute. In the example \u201cthree sisters\u201d document, the `` tag is the parent of the `` tag:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.title" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.title.string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.title.string.parent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching the Tree\n", "\n", "Now, let's consider examples of different filters you can use to search this nested tree of HTML. These filters show up again and again, throughout the search API. You can use them to filter based on a tag\u2019s name, on its attributes, on the text of a string, or on some combination of these.\n", "\n", "#### Use `find_all()` to find all tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One common task is extracting all the URLs found within a page's tags:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# search for all <a> tags; returns a list\n", "soup.find_all('a')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn\n", "# search for all the paragragh tags" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn\n", "# search for all the table tags" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other arguments to the `.find_all()` function include `limit` and `text`. What do those do? " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn \n", "# search for all the <a> tags and use the limit argument " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn \n", "# What does the using the text argument do? " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use `.get()` to extract an attribute" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.find_all('a')[1].get('href')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Looping through tags" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn\n", "# write a for loop printing all the links from reddit" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn\n", "# write a for loop, but use a list comprehension this time\n", "# show the first 5 elements" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# your turn\n", "# split the first url by \"/\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common task is extracting all the text from a page:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(soup.get_text())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# JSON\n", "\n", "#### Working with Web APIs\n", "Web APIs are a more convenient way for programs to interact with websites. Many webistes now have a nice API that gives access to it's data in JSON format.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a = {'a': 1, 'b':2}\n", "s = json.dumps(a)\n", "a2 = json.loads(s)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "a # a dictionary" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "s # s is a string containing a in JSON encoding" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "a2 # reading back the keys are now in unicode" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## World Cup in JSON!\n", "\n", "The [2014 FIFA World Cup](http://en.wikipedia.org/wiki/2014_FIFA_World_Cup) was held this summer in Brazil at several different venues. There was an [API created for the World Cup](http://worldcup.sfg.io) that scraped current match results and output match data as JSON. Possible output includes events such as goals, substitutions, and cards. The [actual matches are listed here](http://worldcup.sfg.io/matches) in JSON. \n", "\n", "* Example from [Fernando Masanori](https://gist.github.com/fmasanori/1288160dad16cc473a53)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url = \"http://worldcup.sfg.io/matches\"\n", "data = urllib2.urlopen(url).read()\n", "wc = json.loads(data.decode('utf-8'))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "\"Number of matches in 2014 World Cup: %i\" % len(wc)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ "'Number of matches in 2014 World Cup: 64'" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "# Print keys in first match\n", "gameIndex = 60\n", "wc[gameIndex].keys()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "[u'status',\n", " u'match_number',\n", " u'home_team',\n", " u'away_team',\n", " u'winner_code',\n", " u'winner',\n", " u'away_team_events',\n", " u'datetime',\n", " u'location',\n", " u'home_team_events']" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "wc[gameIndex]['status']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 38, "text": [ "u'completed'" ] } ], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "wc[gameIndex]['match_number']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 39, "text": [ "61" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "wc[gameIndex]['away_team']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 40, "text": [ "{u'code': u'GER', u'country': u'Germany', u'goals': 7}" ] } ], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "wc[gameIndex]['away_team_events']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 41, "text": [ "[{u'id': 1354,\n", " u'player': u'M\\xdcller',\n", " u'time': u'11',\n", " u'type_of_event': u'goal'},\n", " {u'id': 1355, u'player': u'Klose', u'time': u'23', u'type_of_event': u'goal'},\n", " {u'id': 1356, u'player': u'Kroos', u'time': u'24', u'type_of_event': u'goal'},\n", " {u'id': 1357, u'player': u'Kroos', u'time': u'26', u'type_of_event': u'goal'},\n", " {u'id': 1358,\n", " u'player': u'Khedira',\n", " u'time': u'29',\n", " u'type_of_event': u'goal'},\n", " {u'id': 1363,\n", " u'player': u'Hummels',\n", " u'time': u'46',\n", " u'type_of_event': u'substitution-out halftime'},\n", " {u'id': 1364,\n", " u'player': u'Mertesacker',\n", " u'time': u'46',\n", " u'type_of_event': u'substitution-in halftime'},\n", " {u'id': 1365,\n", " u'player': u'Klose',\n", " u'time': u'58',\n", " u'type_of_event': u'substitution-out'},\n", " {u'id': 1366,\n", " u'player': u'Sch\\xdcrrle',\n", " u'time': u'58',\n", " u'type_of_event': u'substitution-in'},\n", " {u'id': 1370,\n", " u'player': u'Sch\\xdcrrle',\n", " u'time': u'69',\n", " u'type_of_event': u'goal'},\n", " {u'id': 1372,\n", " u'player': u'Draxler',\n", " u'time': u'76',\n", " u'type_of_event': u'substitution-in'},\n", " {u'id': 1371,\n", " u'player': u'Khedira',\n", " u'time': u'76',\n", " u'type_of_event': u'substitution-out'},\n", " {u'id': 1373,\n", " u'player': u'Sch\\xdcrrle',\n", " u'time': u'79',\n", " u'type_of_event': u'goal'}]" ] } ], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "wc[gameIndex]['home_team']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 42, "text": [ "{u'code': u'BRA', u'country': u'Brazil', u'goals': 1}" ] } ], "prompt_number": 42 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [Brazil v Germany (2014 FIFA World Cup)](http://en.wikipedia.org/wiki/Brazil_v_Germany_(2014_FIFA_World_Cup) match on July 8, 2014 where Germany score the most goals in World Cup tournament history. Germany led 5\u20130 at half time, with 4 goals scored in a span of 6 minutes, and subsequently brought the score up to 7\u20130 in the second half. Brazil scored a goal at the last minute, ending the match 7\u20131. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the team names and goals scored for each match" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in wc:\n", " print elem['home_team']['country'], elem['home_team']['goals'], elem['away_team']['country'], elem['away_team']['goals']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Brazil 3 Croatia 1\n", "Mexico 1 Cameroon 0\n", "Spain 1 Netherlands 5\n", "Chile 3 Australia 1\n", "Colombia 3 Greece 0\n", "Ivory Coast 2 Japan 1\n", "Uruguay 1 Costa Rica 3\n", "England 1 Italy 2\n", "Switzerland 2 Ecuador 1\n", "France 3 Honduras 0\n", "Argentina 2 Bosnia and Herzegovina 1\n", "Iran 0 Nigeria 0\n", "Germany 4 Portugal 0\n", "Ghana 1 USA 2\n", "Belgium 2 Algeria 1\n", "Russia 1 Korea Republic 1\n", "Brazil 0 Mexico 0\n", "Cameroon 0 Croatia 4\n", "Spain 0 Chile 2\n", "Australia 2 Netherlands 3\n", "Colombia 2 Ivory Coast 1\n", "Japan 0 Greece 0\n", "Uruguay 2 England 1\n", "Italy 0 Costa Rica 1\n", "Switzerland 2 France 5\n", "Honduras 1 Ecuador 2\n", "Argentina 1 Iran 0\n", "Nigeria 1 Bosnia and Herzegovina 0\n", "Germany 2 Ghana 2\n", "USA 2 Portugal 2\n", "Belgium 1 Russia 0\n", "Korea Republic 2 Algeria 4\n", "Cameroon 1 Brazil 4\n", "Croatia 1 Mexico 3\n", "Australia 0 Spain 3\n", "Netherlands 2 Chile 0\n", "Japan 1 Colombia 4\n", "Greece 2 Ivory Coast 1\n", "Italy 0 Uruguay 1\n", "Costa Rica 0 England 0\n", "Honduras 0 Switzerland 3\n", "Ecuador 0 France 0\n", "Nigeria 2 Argentina 3\n", "Bosnia and Herzegovina 3 Iran 1\n", "USA 0 Germany 1\n", "Portugal 2 Ghana 1\n", "Korea Republic 0 Belgium 1\n", "Algeria 1 Russia 1\n", "Brazil 1 Chile 1\n", "Colombia 2 Uruguay 0\n", "Netherlands 2 Mexico 1\n", "Costa Rica 1 Greece 1\n", "France 2 Nigeria 0\n", "Germany 2 Algeria 1\n", "Argentina 1 Switzerland 0\n", "Belgium 2 USA 1\n", "Brazil 2 Colombia 1\n", "France 0 Germany 1\n", "Netherlands 0 Costa Rica 0\n", "Argentina 1 Belgium 0\n", "Brazil 1 Germany 7\n", "Netherlands 0 Argentina 0\n", "Brazil 0 Netherlands 3\n", "Germany 1 Argentina 0\n" ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a pandas DataFrame from JSON" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])\n", "data.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>match_number</th>\n", " <th>location</th>\n", " <th>datetime</th>\n", " <th>home_team</th>\n", " <th>away_team</th>\n", " <th>winner</th>\n", " <th>home_team_events</th>\n", " <th>away_team_events</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td> 1</td>\n", " <td> Arena de Sao Paulo</td>\n", " <td> 2014-06-12T17:00:00.000-03:00</td>\n", " <td> {u'country': u'Brazil', u'code': u'BRA', u'goa...</td>\n", " <td> {u'country': u'Croatia', u'code': u'CRO', u'go...</td>\n", " <td> Brazil</td>\n", " <td> [{u'type_of_event': u'goal-own', u'player': u'...</td>\n", " <td> [{u'type_of_event': u'substitution-in', u'play...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td> 2</td>\n", " <td> Estadio das Dunas</td>\n", " <td> 2014-06-13T13:00:00.000-03:00</td>\n", " <td> {u'country': u'Mexico', u'code': u'MEX', u'goa...</td>\n", " <td> {u'country': u'Cameroon', u'code': u'CMR', u'g...</td>\n", " <td> Mexico</td>\n", " <td> [{u'type_of_event': u'yellow-card', u'player':...</td>\n", " <td> [{u'type_of_event': u'substitution-in halftime...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td> 3</td>\n", " <td> Arena Fonte Nova</td>\n", " <td> 2014-06-13T16:00:00.000-03:00</td>\n", " <td> {u'country': u'Spain', u'code': u'ESP', u'goal...</td>\n", " <td> {u'country': u'Netherlands', u'code': u'NED', ...</td>\n", " <td> Netherlands</td>\n", " <td> [{u'type_of_event': u'goal-penalty', u'player'...</td>\n", " <td> [{u'type_of_event': u'yellow-card', u'player':...</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td> 4</td>\n", " <td> Arena Pantanal</td>\n", " <td> 2014-06-13T19:00:00.000-03:00</td>\n", " <td> {u'country': u'Chile', u'code': u'CHI', u'goal...</td>\n", " <td> {u'country': u'Australia', u'code': u'AUS', u'...</td>\n", " <td> Chile</td>\n", " <td> [{u'type_of_event': u'goal', u'player': u'Alex...</td>\n", " <td> [{u'type_of_event': u'goal', u'player': u'Cahi...</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td> 5</td>\n", " <td> Estadio Mineirao</td>\n", " <td> 2014-06-14T13:00:00.000-03:00</td>\n", " <td> {u'country': u'Colombia', u'code': u'COL', u'g...</td>\n", " <td> {u'country': u'Greece', u'code': u'GRE', u'goa...</td>\n", " <td> Colombia</td>\n", " <td> [{u'type_of_event': u'goal', u'player': u'P. A...</td>\n", " <td> [{u'type_of_event': u'yellow-card', u'player':...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "metadata": {}, "output_type": "pyout", "prompt_number": 44, "text": [ " match_number location datetime \\\n", "0 1 Arena de Sao Paulo 2014-06-12T17:00:00.000-03:00 \n", "1 2 Estadio das Dunas 2014-06-13T13:00:00.000-03:00 \n", "2 3 Arena Fonte Nova 2014-06-13T16:00:00.000-03:00 \n", "3 4 Arena Pantanal 2014-06-13T19:00:00.000-03:00 \n", "4 5 Estadio Mineirao 2014-06-14T13:00:00.000-03:00 \n", "\n", " home_team \\\n", "0 {u'country': u'Brazil', u'code': u'BRA', u'goa... \n", "1 {u'country': u'Mexico', u'code': u'MEX', u'goa... \n", "2 {u'country': u'Spain', u'code': u'ESP', u'goal... \n", "3 {u'country': u'Chile', u'code': u'CHI', u'goal... \n", "4 {u'country': u'Colombia', u'code': u'COL', u'g... \n", "\n", " away_team winner \\\n", "0 {u'country': u'Croatia', u'code': u'CRO', u'go... Brazil \n", "1 {u'country': u'Cameroon', u'code': u'CMR', u'g... Mexico \n", "2 {u'country': u'Netherlands', u'code': u'NED', ... Netherlands \n", "3 {u'country': u'Australia', u'code': u'AUS', u'... Chile \n", "4 {u'country': u'Greece', u'code': u'GRE', u'goa... Colombia \n", "\n", " home_team_events \\\n", "0 [{u'type_of_event': u'goal-own', u'player': u'... \n", "1 [{u'type_of_event': u'yellow-card', u'player':... \n", "2 [{u'type_of_event': u'goal-penalty', u'player'... \n", "3 [{u'type_of_event': u'goal', u'player': u'Alex... \n", "4 [{u'type_of_event': u'goal', u'player': u'P. A... \n", "\n", " away_team_events \n", "0 [{u'type_of_event': u'substitution-in', u'play... \n", "1 [{u'type_of_event': u'substitution-in halftime... \n", "2 [{u'type_of_event': u'yellow-card', u'player':... \n", "3 [{u'type_of_event': u'goal', u'player': u'Cahi... \n", "4 [{u'type_of_event': u'yellow-card', u'player':... " ] } ], "prompt_number": 44 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert format of a column\n", "\n", "Here we pandas [DatetimeIndex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html) to convert the `datetime` column to two seperate columns: a date and a time for each match." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data['gameDate'] = pd.DatetimeIndex(data.datetime).date\n", "data['gameTime'] = pd.DatetimeIndex(data.datetime).time" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": [ "data.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>match_number</th>\n", " <th>location</th>\n", " <th>datetime</th>\n", " <th>home_team</th>\n", " <th>away_team</th>\n", " <th>winner</th>\n", " <th>home_team_events</th>\n", " <th>away_team_events</th>\n", " <th>gameDate</th>\n", " <th>gameTime</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td> 1</td>\n", " <td> Arena de Sao Paulo</td>\n", " <td> 2014-06-12T17:00:00.000-03:00</td>\n", " <td> {u'country': u'Brazil', u'code': u'BRA', u'goa...</td>\n", " <td> {u'country': u'Croatia', u'code': u'CRO', u'go...</td>\n", " <td> Brazil</td>\n", " <td> [{u'type_of_event': u'goal-own', u'player': u'...</td>\n", " <td> [{u'type_of_event': u'substitution-in', u'play...</td>\n", " <td> 2014-06-12</td>\n", " <td> 20:00:00</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td> 2</td>\n", " <td> Estadio das Dunas</td>\n", " <td> 2014-06-13T13:00:00.000-03:00</td>\n", " <td> {u'country': u'Mexico', u'code': u'MEX', u'goa...</td>\n", " <td> {u'country': u'Cameroon', u'code': u'CMR', u'g...</td>\n", " <td> Mexico</td>\n", " <td> [{u'type_of_event': u'yellow-card', u'player':...</td>\n", " <td> [{u'type_of_event': u'substitution-in halftime...</td>\n", " <td> 2014-06-13</td>\n", " <td> 16:00:00</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td> 3</td>\n", " <td> Arena Fonte Nova</td>\n", " <td> 2014-06-13T16:00:00.000-03:00</td>\n", " <td> {u'country': u'Spain', u'code': u'ESP', u'goal...</td>\n", " <td> {u'country': u'Netherlands', u'code': u'NED', ...</td>\n", " <td> Netherlands</td>\n", " <td> [{u'type_of_event': u'goal-penalty', u'player'...</td>\n", " <td> [{u'type_of_event': u'yellow-card', u'player':...</td>\n", " <td> 2014-06-13</td>\n", " <td> 19:00:00</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td> 4</td>\n", " <td> Arena Pantanal</td>\n", " <td> 2014-06-13T19:00:00.000-03:00</td>\n", " <td> {u'country': u'Chile', u'code': u'CHI', u'goal...</td>\n", " <td> {u'country': u'Australia', u'code': u'AUS', u'...</td>\n", " <td> Chile</td>\n", " <td> [{u'type_of_event': u'goal', u'player': u'Alex...</td>\n", " <td> [{u'type_of_event': u'goal', u'player': u'Cahi...</td>\n", " <td> 2014-06-13</td>\n", " <td> 22:00:00</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td> 5</td>\n", " <td> Estadio Mineirao</td>\n", " <td> 2014-06-14T13:00:00.000-03:00</td>\n", " <td> {u'country': u'Colombia', u'code': u'COL', u'g...</td>\n", " <td> {u'country': u'Greece', u'code': u'GRE', u'goa...</td>\n", " <td> Colombia</td>\n", " <td> [{u'type_of_event': u'goal', u'player': u'P. A...</td>\n", " <td> [{u'type_of_event': u'yellow-card', u'player':...</td>\n", " <td> 2014-06-14</td>\n", " <td> 16:00:00</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "metadata": {}, "output_type": "pyout", "prompt_number": 46, "text": [ " match_number location datetime \\\n", "0 1 Arena de Sao Paulo 2014-06-12T17:00:00.000-03:00 \n", "1 2 Estadio das Dunas 2014-06-13T13:00:00.000-03:00 \n", "2 3 Arena Fonte Nova 2014-06-13T16:00:00.000-03:00 \n", "3 4 Arena Pantanal 2014-06-13T19:00:00.000-03:00 \n", "4 5 Estadio Mineirao 2014-06-14T13:00:00.000-03:00 \n", "\n", " home_team \\\n", "0 {u'country': u'Brazil', u'code': u'BRA', u'goa... \n", "1 {u'country': u'Mexico', u'code': u'MEX', u'goa... \n", "2 {u'country': u'Spain', u'code': u'ESP', u'goal... \n", "3 {u'country': u'Chile', u'code': u'CHI', u'goal... \n", "4 {u'country': u'Colombia', u'code': u'COL', u'g... \n", "\n", " away_team winner \\\n", "0 {u'country': u'Croatia', u'code': u'CRO', u'go... Brazil \n", "1 {u'country': u'Cameroon', u'code': u'CMR', u'g... Mexico \n", "2 {u'country': u'Netherlands', u'code': u'NED', ... Netherlands \n", "3 {u'country': u'Australia', u'code': u'AUS', u'... Chile \n", "4 {u'country': u'Greece', u'code': u'GRE', u'goa... Colombia \n", "\n", " home_team_events \\\n", "0 [{u'type_of_event': u'goal-own', u'player': u'... \n", "1 [{u'type_of_event': u'yellow-card', u'player':... \n", "2 [{u'type_of_event': u'goal-penalty', u'player'... \n", "3 [{u'type_of_event': u'goal', u'player': u'Alex... \n", "4 [{u'type_of_event': u'goal', u'player': u'P. A... \n", "\n", " away_team_events gameDate gameTime \n", "0 [{u'type_of_event': u'substitution-in', u'play... 2014-06-12 20:00:00 \n", "1 [{u'type_of_event': u'substitution-in halftime... 2014-06-13 16:00:00 \n", "2 [{u'type_of_event': u'yellow-card', u'player':... 2014-06-13 19:00:00 \n", "3 [{u'type_of_event': u'goal', u'player': u'Cahi... 2014-06-13 22:00:00 \n", "4 [{u'type_of_event': u'yellow-card', u'player':... 2014-06-14 16:00:00 " ] } ], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 46 } ], "metadata": {} } ] }