{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is the data munging part of the visualization of the interconnectedness of my top 30\\* most edited articles on Wikipedia (I go by [Resident Mario](https://en.wikipedia.org/wiki/User:Resident_Mario) on the encyclopedia), as reported by IBM Watson's [Concept Insight](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/concept-insights.html) API service. The data is scraped from the [Supercount Wikimedia Lab tool](https://tools.wmflabs.org/supercount/) with `requests` and `beautifulsoup`, interwoven using `watsongraph`, and visualized using `d3.js`.\n",
    "\n",
    "The techniques here could eventually be easily applied to any editor! A [widget](https://github.com/jdfreder/ipython-d3networkx/blob/master/examples/demo%20simple.ipynb) for visualizing any editor's top articles is forthcoming once `watsongraph` makes it to the `0.3.0` release.\n",
    "\n",
    "\\* The cutoff is due to a [technical limitation](https://github.com/ResidentMario/watsongraph/issues/8)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from watsongraph.conceptmodel import ConceptModel\n",
    "from watsongraph.node import conceptualize\n",
    "import json\n",
    "import requests\n",
    "import bs4\n",
    "\n",
    "def get_top_thirty_articles(username):\n",
    "    \"\"\"\n",
    "    Performs a raw call to the Supercount edit counter, and parses the response to get at a list of links\n",
    "    on that page.\n",
    "    Output looks like this:\n",
    "     [<li><a href=\"//en.wikipedia.org/wiki/:Hawaii_hotspot\">Hawaii hotspot</a> — 603</li>,\n",
    "      <li><a href=\"//en.wikipedia.org/wiki/:Mauna_Kea\">Mauna Kea</a> — 543</li>,\n",
    "      ...]\n",
    "    \"\"\"\n",
    "    u = username.replace(\" \", \"+\")\n",
    "    # Currently limited to 30 articles because of batch limitation.\n",
    "    # cf. https://github.com/ResidentMario/watsongraph/issues/8\n",
    "    url = \"https://tools.wmflabs.org/supercount/index.php?user={}&project=en.wikipedia.org&toplimit=30\".format(u)\n",
    "    r = requests.get(url)\n",
    "    # Surprisingly, requests guesses the wrong encoding and tries 'ISO-8859-1' by default.\n",
    "    # This blows up e.g. \"Lōʻihi_Seamount\", which becomes a garbled mess.\n",
    "    # Easy to fix by swapping the encoding but finding this programmatic misstep took me some time.\n",
    "    # cf. http://docs.python-requests.org/en/latest/user/quickstart/#response-content\n",
    "    r.encoding = 'utf-8'\n",
    "    raw_data = r.text\n",
    "    raw_links = list(bs4.BeautifulSoup(raw_data, \"html.parser\").find_all('li'))\n",
    "    return raw_links\n",
    "\n",
    "def parse_articles(list_of_links):\n",
    "    \"\"\"\n",
    "    After running get_top_fifty_articles() we get a list of links.\n",
    "    It looks like this:\n",
    "     [<li><a href=\"//en.wikipedia.org/wiki/:Hawaii_hotspot\">Hawaii hotspot</a> — 603</li>,\n",
    "      <li><a href=\"//en.wikipedia.org/wiki/:Mauna_Kea\">Mauna Kea</a> — 543</li>,\n",
    "      ...]\n",
    "    This method takes this info and returns a list of dicts of article names and edits that looks like this:\n",
    "    [{'article': 'Types of volcanic eruptions', 'edits': '236'}, ...]\n",
    "    \"\"\"\n",
    "    ret = []\n",
    "    for link in list_of_links:\n",
    "        if \"/:\" in str(link):\n",
    "            text = link.get_text()\n",
    "            # At this point we are parsing e.g. 'Hawaii hotspot — 603'\n",
    "            article = text[:text.index(\"—\") - 1]\n",
    "            edits = text[text.index(\"—\") + 2:]\n",
    "            ret.append({'article': article, 'edits': int(edits)})\n",
    "    return ret\n",
    "\n",
    "def clean_and_model(data_dict):\n",
    "    \"\"\"\n",
    "    After running get_top_fifty_articles() and parse_articles() we are left with the information we need,\n",
    "    encoded in the following format:\n",
    "    [{'article': 'Types of volcanic eruptions', 'edits': '236'}, ...]\n",
    "    Now we build the model for the whole thing.\n",
    "    \"\"\"\n",
    "    # At this step we ought to clean input using conceptualize() to back-trace, but unicode breaks:\n",
    "    # conceptualize() does not support unicode.\n",
    "    # cf. https://github.com/ResidentMario/watsongraph/issues/11\n",
    "    # So this one time I cleaned the data by hand (fixed one problem point).\n",
    "    # Once I rewrite the access methods and push this library out to version 0.3.0 this should be fixed.\n",
    "    # cf.\n",
    "    contributions = ConceptModel([dat['article'] for dat in data_dict])\n",
    "    contributions.explode_edges(prune=True)\n",
    "    for entry in data_dict:\n",
    "        contributions.set_property(entry['article'], \"edits\", entry['edits'])\n",
    "        # print(contributions.get_node(entry['article']).properties['edits'])\n",
    "    return contributions\n",
    "\n",
    "def save_model(model):\n",
    "    \"\"\"\n",
    "    The final step: saves the model!\n",
    "    \"\"\"\n",
    "    with open('contributions.json', 'w') as file:\n",
    "        file.write(json.dumps(model.to_json(), indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dataset = parse_articles(get_top_thirty_articles(\"Resident Mario\"))\n",
    "# manually heal a problem point.\n",
    "# cf. \n",
    "dataset[23]['article'] = 'Ferdinandea'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "contributions = clean_and_model(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "save_model(contributions)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}