{ "metadata": { "name": "", "signature": "sha256:ff8679825ca5378d17b005803647000fd6aae74bab00fe8322dde6bf52c31947" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "How to get full play-by-play data for the WC2014" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good football data is hard to come by. Basic stat counts are easily available, but full play data (i.e. a play broken down in its individual components: interceptions and tackles, runs, passes and shots, etc.) is very rare. And that's the most important unit in a team sport like football. So imagine my surprise and great joy when I came across a fantastic dataset of full play-by-play data for all World Cup matches.\n", "\n", "After spending some time in the wonderful world of web scraping, one becomes aware of hints that something worthwhile is going on. Whenever I see a pretty interactive chart on a web-page like the great [Huff Post Data's World Cup page](http://data.huffingtonpost.com/2014/world-cup), my spider sense starts tingling.\n", "\n", "The first thing to do is to make sure that the site is using only html5 and js. Check.\n", "\n", "OK, so how is the website sending the data to the browser? Developer tools in Chrome is your friend: network tab, filter \"json\".\n", "\n", "*Bingo.*\n", "\n", "The website was sending the full dataset to the browser." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#imports\n", "import requests\n", "import json\n", "import mechanize\n", "from bs4 import BeautifulSoup\n", "import time\n", "\n", "#initializes the browser\n", "br = mechanize.Browser()\n", "br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=10)\n", "br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at a match page, you see that all links are listed in a handy menu. Looking at the html code, you can see that they are inside a span tag with class set to \"matchup\".\n", "\n", "So the following code gets you all the links:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "starting_link = 'http://data.huffingtonpost.com/2014/world-cup/matches/belgium-vs-usa-731822'\n", "response = mechanize.urlopen(starting_link)\n", "soup = BeautifulSoup(response)\n", "\n", "links_html = soup.find_all(\"span\", class_=\"matchup\")\n", "\n", "links = []\n", "\n", "for link_html in links_html:\n", " a = link_html.find_all('a')\n", " for l in a:\n", " link = l.get('href')\n", " link = link.split('/')[-1]\n", "\n", " links.append(link)\n", " \n", "links[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "['brazil-vs-chile-731815',\n", " 'colombia-vs-uruguay-731816',\n", " 'netherlands-vs-mexico-731817',\n", " 'costa-rica-vs-greece-731818',\n", " 'france-vs-nigeria-731819']" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this information we can get all the match data trough a simple request, which gives back easily readable json." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_match_data(match):\n", " match_id = match.split('-')[-1]\n", " response = mechanize.urlopen('http://data.huffingtonpost.com/2014/world-cup/matches/%s.json' % match_id)\n", "\n", " match_data = json.loads(response.read())\n", " \n", " return match_data\n", "\n", "match_data = get_match_data(links[0]) # test\n", "\n", "match_data.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "[u'team_stats', u'events', u'summary']" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, the data includes IDs only. The page has names, though, so there must be some conversion taking place. At this point, I was scared that I to look through all script files and javascript code to see where the conversion took place.\n", "\n", "However, the first (and obvious) step was enough: simply searching for a player's name in the main page source showed that variables HPIN.teams and HPIN.players contained the names and IDs, plus a bunch of other information (like position, birth date and even preferred foot). The script tag that defined the variables has no class or id, so we could only identify it by its position." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_match_names(match):\n", " response = mechanize.urlopen('http://data.huffingtonpost.com/2014/world-cup/matches/%s' % links[0]) #example page\n", " soup = BeautifulSoup(response)\n", "\n", " data = {}\n", "\n", " data_script = soup.findAll(\"script\")[1] #gets the second script block. Hopefully all pages follow the same format\n", " data_lines = data_script.text.split('\\n')\n", "\n", " for line in data_lines[1:]:\n", " try:\n", " #format of a variable is HPIN.variable = [list of dictionaries]\n", " #this tries to convert it to \n", " line_data = line.split(' = ')\n", " name = line_data[0].split('.')[1]\n", " value = json.loads(line_data[1][:-1])\n", " data[name] = value\n", " except:\n", " print \"error parsing string: \", line #should only occur on blank lines - yeah, I know, lazy exception handling...\n", " \n", " return data\n", "\n", "names = get_match_names(links[0])\n", "names.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "error parsing string: \n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "[u'statCategories',\n", " u'awayTeam',\n", " u'callbackPath',\n", " u'homeTeam',\n", " u'teams',\n", " u'players',\n", " u'imageCallbackPath',\n", " u'imageCallbackInterval',\n", " u'twitterUrl']" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright, so now we have all the match links, a function that returns the events and stats from each match, and a function that returns the players and team names. Let's put it all together. First, create a dictionary:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data = {}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, execute a loop that will get the data from all the matches and add it to the dictionary. The `if` statement ensures you don't have to reprocess a match in the case you have to run the cell again (e.g. due a network error)." ] }, { "cell_type": "code", "collapsed": true, "input": [ "for match in links:\n", " if match not in data:\n", " print match\n", " time.sleep(60)\n", "\n", " match_data = get_match_data(match)\n", " match_names = get_match_names(match)\n", " data[match] = {'data': match_data, 'names': match_names}\n", "\n", " print match, \" done\" \n", " else:\n", " print match, \" already processed\"\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "brazil-vs-chile-731815\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "brazil-vs-chile-731815 done\n", "colombia-vs-uruguay-731816\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "colombia-vs-uruguay-731816 done\n", "netherlands-vs-mexico-731817\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "netherlands-vs-mexico-731817 done\n", "costa-rica-vs-greece-731818\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "costa-rica-vs-greece-731818 done\n", "france-vs-nigeria-731819\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "france-vs-nigeria-731819 done\n", "germany-vs-algeria-731820\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "germany-vs-algeria-731820 done\n", "argentina-vs-switzerland-731821\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "argentina-vs-switzerland-731821 done\n", "belgium-vs-usa-731822\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "belgium-vs-usa-731822 done\n", "france-vs-germany-731824\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "france-vs-germany-731824 done\n", "brazil-vs-colombia-731823\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "brazil-vs-colombia-731823 done\n", "argentina-vs-belgium-731826\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "argentina-vs-belgium-731826 done\n", "netherlands-vs-costa-rica-731825\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "netherlands-vs-costa-rica-731825 done\n", "brazil-vs-germany-731827\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "brazil-vs-germany-731827 done\n", "netherlands-vs-argentina-731828\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "netherlands-vs-argentina-731828 done\n", "brazil-vs-netherlands-731829\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "brazil-vs-netherlands-731829 done\n", "germany-vs-argentina-731830\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "germany-vs-argentina-731830 done\n", "brazil-vs-croatia-731767\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "brazil-vs-croatia-731767 done\n", "mexico-vs-cameroon-731768\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "mexico-vs-cameroon-731768 done\n", "brazil-vs-mexico-731783\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "brazil-vs-mexico-731783 done\n", "cameroon-vs-croatia-731784\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "cameroon-vs-croatia-731784 done\n", "croatia-vs-mexico-731800\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "croatia-vs-mexico-731800 done\n", "cameroon-vs-brazil-731799\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "cameroon-vs-brazil-731799 done\n", "spain-vs-netherlands-731769\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "spain-vs-netherlands-731769 done\n", "chile-vs-australia-731770\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "chile-vs-australia-731770 done\n", "australia-vs-netherlands-731786\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "australia-vs-netherlands-731786 done\n", "spain-vs-chile-731785\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "spain-vs-chile-731785 done\n", "netherlands-vs-chile-731802\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "netherlands-vs-chile-731802 done\n", "australia-vs-spain-731801\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "australia-vs-spain-731801 done\n", "colombia-vs-greece-731771\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "colombia-vs-greece-731771 done\n", "ivory-coast-vs-japan-731772\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "ivory-coast-vs-japan-731772 done\n", "colombia-vs-ivory-coast-731787\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "colombia-vs-ivory-coast-731787 done\n", "japan-vs-greece-731788\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "japan-vs-greece-731788 done\n", "japan-vs-colombia-731803\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "japan-vs-colombia-731803 done\n", "greece-vs-ivory-coast-731804\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "greece-vs-ivory-coast-731804 done\n", "uruguay-vs-costa-rica-731773\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "uruguay-vs-costa-rica-731773 done\n", "england-vs-italy-731774\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "england-vs-italy-731774 done\n", "uruguay-vs-england-731789\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "uruguay-vs-england-731789 done\n", "italy-vs-costa-rica-731790\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "italy-vs-costa-rica-731790 done\n", "italy-vs-uruguay-731805\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "italy-vs-uruguay-731805 done\n", "costa-rica-vs-england-731806\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "costa-rica-vs-england-731806 done\n", "switzerland-vs-ecuador-731775\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "switzerland-vs-ecuador-731775 done\n", "france-vs-honduras-731776\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "france-vs-honduras-731776 done\n", "switzerland-vs-france-731791\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "switzerland-vs-france-731791 done\n", "honduras-vs-ecuador-731792\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "honduras-vs-ecuador-731792 done\n", "ecuador-vs-france-731808\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "ecuador-vs-france-731808 done\n", "honduras-vs-switzerland-731807\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "honduras-vs-switzerland-731807 done\n", "argentina-vs-bosnia-herz-731777\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "argentina-vs-bosnia-herz-731777 done\n", "iran-vs-nigeria-731778\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "iran-vs-nigeria-731778 done\n", "argentina-vs-iran-731793\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "argentina-vs-iran-731793 done\n", "nigeria-vs-bosnia-herz-731794\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "nigeria-vs-bosnia-herz-731794 done\n", "nigeria-vs-argentina-731809\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "nigeria-vs-argentina-731809 done\n", "bosnia-herz-vs-iran-731810\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "bosnia-herz-vs-iran-731810 done\n", "germany-vs-portugal-731779\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "germany-vs-portugal-731779 done\n", "ghana-vs-usa-731780\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "ghana-vs-usa-731780 done\n", "germany-vs-ghana-731795\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "germany-vs-ghana-731795 done\n", "usa-vs-portugal-731796\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "usa-vs-portugal-731796 done\n", "portugal-vs-ghana-731812\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "portugal-vs-ghana-731812 done\n", "usa-vs-germany-731811\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "usa-vs-germany-731811 done\n", "belgium-vs-algeria-731781\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "belgium-vs-algeria-731781 done\n", "russia-vs-south-korea-731782\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "russia-vs-south-korea-731782 done\n", "belgium-vs-russia-731797\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "belgium-vs-russia-731797 done\n", "south-korea-vs-algeria-731798\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "south-korea-vs-algeria-731798 done\n", "algeria-vs-russia-731814\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "algeria-vs-russia-731814 done\n", "south-korea-vs-belgium-731813\n", "error parsing string: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", "south-korea-vs-belgium-731813 done\n", "brazil-vs-chile-731815 already processed\n", "colombia-vs-uruguay-731816 already processed\n", "netherlands-vs-mexico-731817 already processed\n", "costa-rica-vs-greece-731818 already processed\n", "france-vs-nigeria-731819 already processed\n", "germany-vs-algeria-731820 already processed\n", "argentina-vs-switzerland-731821 already processed\n", "belgium-vs-usa-731822 already processed\n", "france-vs-germany-731824 already processed\n", "brazil-vs-colombia-731823 already processed\n", "argentina-vs-belgium-731826 already processed\n", "netherlands-vs-costa-rica-731825 already processed\n", "brazil-vs-germany-731827 already processed\n", "netherlands-vs-argentina-731828 already processed\n", "brazil-vs-netherlands-731829 already processed\n", "germany-vs-argentina-731830 already processed\n", "brazil-vs-croatia-731767 already processed\n", "mexico-vs-cameroon-731768 already processed\n", "brazil-vs-mexico-731783 already processed\n", "cameroon-vs-croatia-731784 already processed\n", "croatia-vs-mexico-731800 already processed\n", "cameroon-vs-brazil-731799 already processed\n", "spain-vs-netherlands-731769 already processed\n", "chile-vs-australia-731770 already processed\n", "australia-vs-netherlands-731786 already processed\n", "spain-vs-chile-731785 already processed\n", "netherlands-vs-chile-731802 already processed\n", "australia-vs-spain-731801 already processed\n", "colombia-vs-greece-731771 already processed\n", "ivory-coast-vs-japan-731772 already processed\n", "colombia-vs-ivory-coast-731787 already processed\n", "japan-vs-greece-731788 already processed\n", "japan-vs-colombia-731803 already processed\n", "greece-vs-ivory-coast-731804 already processed\n", "uruguay-vs-costa-rica-731773 already processed\n", "england-vs-italy-731774 already processed\n", "uruguay-vs-england-731789 already processed\n", "italy-vs-costa-rica-731790 already processed\n", "italy-vs-uruguay-731805 already processed\n", "costa-rica-vs-england-731806 already processed\n", "switzerland-vs-ecuador-731775 already processed\n", "france-vs-honduras-731776 already processed\n", "switzerland-vs-france-731791 already processed\n", "honduras-vs-ecuador-731792 already processed\n", "ecuador-vs-france-731808 already processed\n", "honduras-vs-switzerland-731807 already processed\n", "argentina-vs-bosnia-herz-731777 already processed\n", "iran-vs-nigeria-731778 already processed\n", "argentina-vs-iran-731793 already processed\n", "nigeria-vs-bosnia-herz-731794 already processed\n", "nigeria-vs-argentina-731809 already processed\n", "bosnia-herz-vs-iran-731810 already processed\n", "germany-vs-portugal-731779 already processed\n", "ghana-vs-usa-731780 already processed\n", "germany-vs-ghana-731795 already processed\n", "usa-vs-portugal-731796 already processed\n", "portugal-vs-ghana-731812 already processed\n", "usa-vs-germany-731811 already processed\n", "belgium-vs-algeria-731781 already processed\n", "russia-vs-south-korea-731782 already processed\n", "belgium-vs-russia-731797 already processed\n", "south-korea-vs-algeria-731798 already processed\n", "algeria-vs-russia-731814 already processed\n", "south-korea-vs-belgium-731813 already processed\n" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "print len(data.keys()) #make sure you have all 64 games" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "64\n" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "import pickle\n", "\n", "pickle.dump(data, open( \"wc2014.p\", \"wb\"))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "data == pickle.load(open(\"wc2014.p\", \"rb\")) #because I'm a bit OCD and want to make sure the data was properly stored" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 32, "text": [ "True" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##That's all folks!\n", "\n", "The boring part is over. Now, it's time to play :)\n", "\n", "Check out my [WC final analysis notebook](http://nbviewer.ipython.org/github/rjtavares/football-crunching/blob/master/notebooks/an%20exploratory%20data%20analysis%20of%20the%20world%20cup%20final.ipynb) for an example of what you can do with the data, and follow my github repository [Football Crunching](https://github.com/rjtavares/football-crunching) for more analysis in the future." ] } ], "metadata": {} } ] }