{ "metadata": { "name": "", "signature": "sha256:a18be62d14ce934a49f181450958d112885d7f28b7da332afbb367d49f880989" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Wikipedia edit stream demo\n", "\n", "A demo of [Snake Charmer](https://github.com/snake-charmer-devs/snake-charmer), originally presented at a [PyData London meetup](http://www.meetup.com/PyData-London-Meetup/) in August 2014, showing:\n", "\n", "* Reading data from a [WebSockets](https://www.websocket.org/) stream in real time, in a background thread\n", "* Pulling this data into [Pandas](http://pandas.pydata.org/) for analysis and visualization\n", "* Training a [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) language classifier on this stream, in real time\n", "* Visualizing the classifier's behaviour using [Matplotlib](http://matplotlib.org/) and [D3](http://d3js.org/)\n", "\n", "The demo is designed to be run using Snake Charmer [release a18945d](https://github.com/snake-charmer-devs/snake-charmer/tree/b86184ecd92cd785b443163f6ecef4366548f591) or later.\n", "\n", "Wikipedia streaming is powered by the `wikipedia_updates` function, which is based on the [stream.py](https://github.com/edsu/wikistream/blob/master/stream.py) script provided with [edsu/wikistream](https://github.com/edsu/wikistream)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "import json\n", "import time\n", "import threading\n", "\n", "from collections import defaultdict\n", "from itertools import chain\n", "from requests import post, get\n", "from math import sqrt\n", "\n", "from wabbit_wappa import VW\n", "\n", "import numpy as np\n", "\n", "import pandas as pd\n", "pd.options.display.mpl_style = 'default'\n", "\n", "import seaborn as sb\n", "import matplotlib.pylab as pylab\n", "import matplotlib.pyplot as plt\n", "pylab.rcParams['figure.figsize'] = (12.0, 8.0)\n", "pylab.rcParams['font.family'] = 'Bitstream Vera Sans'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## wikipedia_updates\n", "\n", "This function talks to the `wikistream` server, which provides push notifications of Wikipedia edits via [Socket.io](http://socket.io/).\n", "\n", "Call it with a callback function of your own devising, that takes a single argument. Every time a Wikipedia edit occurs, your callback will be called, with a dict containing the details of the edit, e.g.:\n", "\n", "```\n", "{'flag': '',\n", " 'namespace': 'article',\n", " 'userUrl': 'http://en.wikipedia.org/wiki/User:137.44.83.167',\n", " 'url': 'http://en.wikipedia.org/w/index.php?diff=619400522&oldid=618259633',\n", " 'wikipediaLong': 'English Wikipedia',\n", " 'wikipediaShort': 'en',\n", " 'user': '137.44.83.167',\n", " 'comment': '',\n", " 'newPage': False,\n", " 'pageUrl': 'http://en.wikipedia.org/wiki/Battle_of_Trafalgar',\n", " 'unpatrolled': False,\n", " 'robot': False,\n", " 'anonymous': True,\n", " 'delta': 2,\n", " 'channel': '#en.wikipedia',\n", " 'wikipediaUrl': 'http://en.wikipedia.org',\n", " 'wikipedia': 'English Wikipedia',\n", " 'page': 'Battle of Trafalgar'}\n", "```\n", "\n", "Your callback should return `True` if it wants to keep receiving edits, or `False` to stop." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def wikipedia_updates(callback):\n", " endpoint = \"http://wikistream.inkdroid.org/socket.io/1\"\n", " session_id = post(endpoint).content.decode('utf-8').split(':')[0]\n", " xhr_endpoint = \"/\".join((endpoint, \"xhr-polling\", session_id))\n", "\n", " while True:\n", " t = time.time() * 1000000\n", " response = get(xhr_endpoint, params={'t': t}).content.decode('utf-8')\n", "\n", " chunks = re.split(u'\\ufffd[0-9]+\\ufffd', response)\n", " for chunk in chunks:\n", " parts = chunk.split(':', 3)\n", " if len(parts) == 4:\n", " try:\n", " payload = json.loads(parts[3])['args'][0]\n", " except:\n", " raise ValueError('Received non-json data: ' + chunk)\n", " if not callback(payload):\n", " return" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Watching stream activity in real time\n", "\n", "This is a callback function that collects the edits into a list called `edits`, for offline analysis later. It runs until `max_edits` edits have been received.\n", "\n", "A `timestamp` field is added to each edit, as they don't have this by default.\n", "\n", "Run this cell to start gathering the data in the background.\n", "\n", "### Note\n", "\n", "If at any point you get an error message about non-json data, wait a few seconds, then restart the IPython kernel and retry from the first cell. This seems to be an intermittent problem with the server -- or possibly the requests library?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "max_edits = 500\n", "edits = []\n", "\n", "def simple_callback(edit):\n", " edit['timestamp'] = time.time()\n", " edits.append(edit)\n", " return len(edits) < max_edits\n", "\n", "thr = threading.Thread(target=wikipedia_updates, args=(simple_callback,))\n", "thr.start()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can watch the data as it arrives. Run the following cell repeatedly, until it tells you it has finished." ] }, { "cell_type": "code", "collapsed": false, "input": [ "reg = re.compile(r'^.*wiki/(.*)')\n", "\n", "c = len(edits)\n", "n = min(5, c)\n", "print('Last %d of %d edits...\\n' % (n, c))\n", "for edit in edits[-n:]:\n", " wiki_name = edit['wikipediaLong']\n", " page_name = reg.match(edit['pageUrl']).group(1)\n", " comment = edit['comment']\n", " print('\\t'.join((wiki_name, page_name, comment)), '\\n', flush=True)\n", "if c == max_edits:\n", " print('Done.\\n')\n", "else:\n", " print('Still streaming.\\n')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Last 5 of 500 edits...\n", "\n", "Wikidata\tQ1022537\t/* wbsetdescription-add:1|nn */ Auto-description for Norwegian (Nynorsk) \n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Wikidata\tQ1029016\t/* wbsetdescription-add:1|nb */ Auto-description for Norwegian (Bokm\u00e5l) \n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Wikidata\tQ17744235\t/* wbeditentity-create:0| */ \n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "English Wikipedia\tWikipedia:Tutorial/Editing/sandbox\t \n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Wikidata\tQ2068332\t/* wbeditentity-update:0| */ Added: [[eu:Rhynchobatus luebberti]] \n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n", "\n" ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import into pandas\n", "\n", "Let's add all these edits into a `DataFrame` to make them easier to work with." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame(edits)\n", "df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | anonymous | \n", "channel | \n", "comment | \n", "delta | \n", "flag | \n", "namespace | \n", "newPage | \n", "page | \n", "pageUrl | \n", "robot | \n", "timestamp | \n", "unpatrolled | \n", "url | \n", "user | \n", "userUrl | \n", "wikipedia | \n", "wikipediaLong | \n", "wikipediaShort | \n", "wikipediaUrl | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "False | \n", "#en.wikipedia | \n", "/* top */ | \n", "-5 | \n", "M | \n", "article | \n", "False | \n", "Android version history | \n", "http://en.wikipedia.org/wiki/Android_version_h... | \n", "False | \n", "2014-08-31 18:32:09.978374 | \n", "False | \n", "http://en.wikipedia.org/w/index.php?diff=62360... | \n", "JordanKyser22 | \n", "http://en.wikipedia.org/wiki/User:JordanKyser22 | \n", "English Wikipedia | \n", "English Wikipedia | \n", "en | \n", "http://en.wikipedia.org | \n", "
1 | \n", "False | \n", "#ru.wikipedia | \n", "/* \u041f\u0440\u043e\u0442\u0438\u0432\u043d\u0438\u043a\u0438 */ | \n", "12 | \n", "\n", " | article | \n", "False | \n", "\u0427\u0451\u0440\u043d\u043e-\u0436\u0451\u043b\u0442\u043e-\u0431\u0435\u043b\u044b\u0439 \u0444\u043b\u0430\u0433 | \n", "http://ru.wikipedia.org/wiki/\u0427\u0451\u0440\u043d\u043e-\u0436\u0451\u043b\u0442\u043e-\u0431\u0435\u043b\u044b\u0439... | \n", "False | \n", "2014-08-31 18:32:09.978403 | \n", "False | \n", "http://ru.wikipedia.org/w/index.php?diff=65199... | \n", "\u041a\u0430\u043c\u0430\u0440\u0430\u0434 \u0427\u0435 | \n", "http://ru.wikipedia.org/wiki/User:\u041a\u0430\u043c\u0430\u0440\u0430\u0434 \u0427\u0435 | \n", "Russian Wikipedia | \n", "Russian Wikipedia | \n", "ru | \n", "http://ru.wikipedia.org | \n", "
2 | \n", "False | \n", "#wikidata.wikipedia | \n", "/* wbcreateclaim-create:1| */ [[Property:P1459... | \n", "346 | \n", "B | \n", "article | \n", "False | \n", "Q17744225 | \n", "http://wikidata.org/wiki/Q17744225 | \n", "True | \n", "2014-08-31 18:32:09.978426 | \n", "False | \n", "http://www.wikidata.org/w/index.php?diff=15466... | \n", "Reinheitsgebot | \n", "http://wikidata.org/wiki/User:Reinheitsgebot | \n", "Wikidata | \n", "Wikidata | \n", "wd | \n", "http://wikidata.org | \n", "
3 | \n", "False | \n", "#nl.wikipedia | \n", "Linkonderhoud | \n", "-232 | \n", "\n", " | article | \n", "False | \n", "Beersel | \n", "http://nl.wikipedia.org/wiki/Beersel | \n", "False | \n", "2014-08-31 18:32:09.978446 | \n", "False | \n", "http://nl.wikipedia.org/w/index.php?diff=41990... | \n", "Smile4ever | \n", "http://nl.wikipedia.org/wiki/User:Smile4ever | \n", "Dutch Wikipedia | \n", "Dutch Wikipedia | \n", "nl | \n", "http://nl.wikipedia.org | \n", "
4 | \n", "False | \n", "#wikidata.wikipedia | \n", "/* wbcreateclaim-create:1| */ [[Property:P1412... | \n", "405 | \n", "\n", " | article | \n", "False | \n", "Q3035478 | \n", "http://wikidata.org/wiki/Q3035478 | \n", "False | \n", "2014-08-31 18:32:09.978467 | \n", "False | \n", "http://www.wikidata.org/w/index.php?diff=15466... | \n", "Jura1 | \n", "http://wikidata.org/wiki/User:Jura1 | \n", "Wikidata | \n", "Wikidata | \n", "wd | \n", "http://wikidata.org | \n", "
wikipediaShort | \n", "ar | \n", "ca | \n", "co | \n", "cs | \n", "de | \n", "el | \n", "en | \n", "es | \n", "eu | \n", "fa | \n", "... | \n", "ja | \n", "ko | \n", "nl | \n", "no | \n", "pl | \n", "pt | \n", "ru | \n", "sv | \n", "wd | \n", "zh | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2014-08-31 18:32:09.978374 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "-5 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2014-08-31 18:32:09.978403 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "12 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2014-08-31 18:32:09.978426 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "346 | \n", "NaN | \n", "
2014-08-31 18:32:09.978446 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "-232 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2014-08-31 18:32:09.978467 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "405 | \n", "NaN | \n", "
5 rows \u00d7 24 columns
\n", "wikipediaShort | \n", "ar | \n", "ca | \n", "co | \n", "cs | \n", "de | \n", "el | \n", "en | \n", "es | \n", "eu | \n", "fa | \n", "... | \n", "ja | \n", "ko | \n", "nl | \n", "no | \n", "pl | \n", "pt | \n", "ru | \n", "sv | \n", "wd | \n", "zh | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2014-08-31 18:32:09 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "6 | \n", "0 | \n", "
2014-08-31 18:32:10 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2014-08-31 18:32:11 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2014-08-31 18:32:12 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "1 | \n", "1 | \n", "1 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "9 | \n", "0 | \n", "
2014-08-31 18:32:13 | \n", "2 | \n", "0 | \n", "2 | \n", "0 | \n", "1 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "2 | \n", "11 | \n", "0 | \n", "
5 rows \u00d7 24 columns
\n", "