{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is the data munging part of the visualization of the interconnectedness of my top 30\\* most edited articles on Wikipedia (I go by [Resident Mario](https://en.wikipedia.org/wiki/User:Resident_Mario) on the encyclopedia), as reported by IBM Watson's [Concept Insight](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/concept-insights.html) API service. The data is scraped from the [Supercount Wikimedia Lab tool](https://tools.wmflabs.org/supercount/) with `requests` and `beautifulsoup`, interwoven using `watsongraph`, and visualized using `d3.js`.\n", "\n", "The techniques here could eventually be easily applied to any editor! A [widget](https://github.com/jdfreder/ipython-d3networkx/blob/master/examples/demo%20simple.ipynb) for visualizing any editor's top articles is forthcoming once `watsongraph` makes it to the `0.3.0` release.\n", "\n", "\\* The cutoff is due to a [technical limitation](https://github.com/ResidentMario/watsongraph/issues/8)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from watsongraph.conceptmodel import ConceptModel\n", "from watsongraph.node import conceptualize\n", "import json\n", "import requests\n", "import bs4\n", "\n", "def get_top_thirty_articles(username):\n", " \"\"\"\n", " Performs a raw call to the Supercount edit counter, and parses the response to get at a list of links\n", " on that page.\n", " Output looks like this:\n", " [
  • Hawaii hotspot — 603
  • ,\n", "
  • Mauna Kea — 543
  • ,\n", " ...]\n", " \"\"\"\n", " u = username.replace(\" \", \"+\")\n", " # Currently limited to 30 articles because of batch limitation.\n", " # cf. https://github.com/ResidentMario/watsongraph/issues/8\n", " url = \"https://tools.wmflabs.org/supercount/index.php?user={}&project=en.wikipedia.org&toplimit=30\".format(u)\n", " r = requests.get(url)\n", " # Surprisingly, requests guesses the wrong encoding and tries 'ISO-8859-1' by default.\n", " # This blows up e.g. \"Lōʻihi_Seamount\", which becomes a garbled mess.\n", " # Easy to fix by swapping the encoding but finding this programmatic misstep took me some time.\n", " # cf. http://docs.python-requests.org/en/latest/user/quickstart/#response-content\n", " r.encoding = 'utf-8'\n", " raw_data = r.text\n", " raw_links = list(bs4.BeautifulSoup(raw_data, \"html.parser\").find_all('li'))\n", " return raw_links\n", "\n", "def parse_articles(list_of_links):\n", " \"\"\"\n", " After running get_top_fifty_articles() we get a list of links.\n", " It looks like this:\n", " [
  • Hawaii hotspot — 603
  • ,\n", "
  • Mauna Kea — 543
  • ,\n", " ...]\n", " This method takes this info and returns a list of dicts of article names and edits that looks like this:\n", " [{'article': 'Types of volcanic eruptions', 'edits': '236'}, ...]\n", " \"\"\"\n", " ret = []\n", " for link in list_of_links:\n", " if \"/:\" in str(link):\n", " text = link.get_text()\n", " # At this point we are parsing e.g. 'Hawaii hotspot — 603'\n", " article = text[:text.index(\"—\") - 1]\n", " edits = text[text.index(\"—\") + 2:]\n", " ret.append({'article': article, 'edits': int(edits)})\n", " return ret\n", "\n", "def clean_and_model(data_dict):\n", " \"\"\"\n", " After running get_top_fifty_articles() and parse_articles() we are left with the information we need,\n", " encoded in the following format:\n", " [{'article': 'Types of volcanic eruptions', 'edits': '236'}, ...]\n", " Now we build the model for the whole thing.\n", " \"\"\"\n", " # At this step we ought to clean input using conceptualize() to back-trace, but unicode breaks:\n", " # conceptualize() does not support unicode.\n", " # cf. https://github.com/ResidentMario/watsongraph/issues/11\n", " # So this one time I cleaned the data by hand (fixed one problem point).\n", " # Once I rewrite the access methods and push this library out to version 0.3.0 this should be fixed.\n", " # cf.\n", " contributions = ConceptModel([dat['article'] for dat in data_dict])\n", " contributions.explode_edges(prune=True)\n", " for entry in data_dict:\n", " contributions.set_property(entry['article'], \"edits\", entry['edits'])\n", " # print(contributions.get_node(entry['article']).properties['edits'])\n", " return contributions\n", "\n", "def save_model(model):\n", " \"\"\"\n", " The final step: saves the model!\n", " \"\"\"\n", " with open('contributions.json', 'w') as file:\n", " file.write(json.dumps(model.to_json(), indent=4))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dataset = parse_articles(get_top_thirty_articles(\"Resident Mario\"))\n", "# manually heal a problem point.\n", "# cf. \n", "dataset[23]['article'] = 'Ferdinandea'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "contributions = clean_and_model(dataset)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "save_model(contributions)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }