{
 "metadata": {
  "name": "",
  "signature": "sha256:e31a8c6a81efd500c21e80980fa17fa68c31c0aaba3b3b38c73ae0a57aaa0315"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from __future__ import unicode_literals\n",
      "import json\n",
      "import numpy as np\n",
      "import pandas as pd\n",
      "from pandas import DataFrame, Series\n",
      "import xlrd\n",
      "from collections import defaultdict"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Transforming the PLOS thesaurus\n",
      "\n",
      "The PLOS thesaurus was kindly provided to us as a spreadsheet with thousands of rows, one node per row. It is a polyhierarchy represented in the form of a tree. We need to transform it into a JSON object that also includes article counts for all the nodes in the tree.\n",
      "\n",
      "An example of the desired data structure for PLOS thesaurus:\n",
      "\n",
      "```\n",
      "{\"count\": #total, \"name\": \"PLOS\", \"children\": [\n",
      "     {\"name\": \"Computer and information sciences\",\n",
      "      \"count\": ###,\n",
      "      \"children\": [\n",
      "            {\"name\": \"Information technology\",\n",
      "             \"count\": ###,\n",
      "             \"children\": [\n",
      "                {\"name\": \"Data mining\", \"count\": ###},\n",
      "                {\"name\": \"Data reduction\", \"count\": ###},\n",
      "                {\"name\":  \"Databases\",\n",
      "                  \"count\": ###,\n",
      "                  \"children\": [\n",
      "                    {\"name\": \"Relational databases\", \"count\": ###}\n",
      "              ]\n",
      "            },\n",
      "            ...,\n",
      "            {\"name\": \"Text mining\",\"count\": ###} \n",
      "          ]\n",
      "        }\n",
      "      ]\n",
      "    }, \n",
      "    ...\n",
      "  ]\n",
      "}\n",
      "```\n",
      "\n",
      "Thinking about a Python data structure that could be output to JSON, each node would be a dict. Children are specified as a list of dicts."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Import article data. This is a 400 MB file.\n",
      "df = pd.read_pickle('../data/all_plos_df.pkl')\n",
      "\n",
      "# Drop unused data\n",
      "df.drop(['author', 'title_display', 'journal', 'abstract', 'publication_date', 'score'], axis=1, inplace=True)\n",
      "df.set_index('id', inplace=True)\n",
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>subject</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>id</th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>10.1371/journal.pone.0100281</th>\n",
        "      <td> [/Medicine and health sciences/Pathology and l...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10.1371/journal.pone.0070935</th>\n",
        "      <td> [/Biology and life sciences/Molecular biology/...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10.1371/journal.pone.0066671</th>\n",
        "      <td> [/Biology and life sciences/Population biology...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10.1371/journal.pntd.0002444</th>\n",
        "      <td> [/Earth sciences/Geography/Geographic areas/Ru...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10.1371/journal.pone.0003605</th>\n",
        "      <td> [/Biology and life sciences/Cell biology/Cell ...</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 1 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "                                                                        subject\n",
        "id                                                                             \n",
        "10.1371/journal.pone.0100281  [/Medicine and health sciences/Pathology and l...\n",
        "10.1371/journal.pone.0070935  [/Biology and life sciences/Molecular biology/...\n",
        "10.1371/journal.pone.0066671  [/Biology and life sciences/Population biology...\n",
        "10.1371/journal.pntd.0002444  [/Earth sciences/Geography/Geographic areas/Ru...\n",
        "10.1371/journal.pone.0003605  [/Biology and life sciences/Cell biology/Cell ...\n",
        "\n",
        "[5 rows x 1 columns]"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Let's make sure we are counting articles correctly for each subject node.\n",
      "\n",
      "def count_articles(df, subject_path):\n",
      "    s = df.subject.apply(lambda s: str(s))\n",
      "    matching = s[s.str.contains(subject_path)]\n",
      "    return len(matching)\n",
      "\n",
      "print 'Total articles:', len(df)\n",
      "print 'Science policy:', count_articles(df, 'Science policy')\n",
      "print 'Science policy/Bioethics:', count_articles(df, 'Science policy/Bioethics')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total articles: 121450\n",
        "Science policy: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "531\n",
        "Science policy/Bioethics: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "72\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def tree_from_spreadsheet(f, df, verbose=False):\n",
      "    \n",
      "    subjects = df.subject.apply(lambda s: str(s))\n",
      "    \n",
      "    book = xlrd.open_workbook(f)\n",
      "    pt = book.sheet_by_index(0)\n",
      "    # spreadsheet cells : (row, col) :: cell A1 : (0, 0)\n",
      "\n",
      "    # Initialize a list to contain the thesaurus.\n",
      "    # Our test case will only have one item in this list.\n",
      "    pt_list = []\n",
      "\n",
      "    # Keep track of the path in the tree.\n",
      "    cur_path = Series([np.nan]*10)\n",
      "\n",
      "    for r in range(1, pt.nrows):\n",
      "        # Start on row two.\n",
      "\n",
      "        # Columns: the hierarchy goes up to 10 tiers.\n",
      "        for c in range(10):\n",
      "            if pt.cell_value(r, c):\n",
      "                # If this condition is satisfied, we are at the node that's in this line.\n",
      "\n",
      "                # Construct the path to this node.\n",
      "                # Clean strings because some terms (RNA nomenclature) cause unicode error\n",
      "                text = pt.cell_value(r, c).replace(u'\\u2019', \"'\")\n",
      "                cur_path[c] = text\n",
      "                cur_path[c+1:] = np.nan\n",
      "                path_list = list(cur_path.dropna())\n",
      "                tier = len(path_list)\n",
      "                path_str = '/'.join(path_list)\n",
      "                if verbose:\n",
      "                    print tier, path_str\n",
      "\n",
      "                # Add the node to the JSON-like tree structure.\n",
      "                node = defaultdict(list)\n",
      "                node['name'] = text\n",
      "                node['count']=  len(subjects[subjects.str.contains(path_str)])\n",
      "\n",
      "                # This part is completely ridiculous. But it seems to work.\n",
      "                if tier == 1:\n",
      "                    pt_list.append(node)\n",
      "                elif tier == 2:\n",
      "                    pt_list[-1]['children'].append(node)\n",
      "                elif tier == 3:\n",
      "                    pt_list[-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 4:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 5:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 6:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 7:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 8:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 9:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "                elif tier == 10:\n",
      "                    pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n",
      "\n",
      "                # Go to next row after finding a term. There is only one term listed per row.\n",
      "                break\n",
      "\n",
      "    # Make a single JSON object to contain all the branches.\n",
      "    pt_obj = {'count': len(df), 'name': 'PLOS', 'children': pt_list}\n",
      "\n",
      "    return pt_obj"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Prototyping\n",
      "\n",
      "Experimenting on smaller subsets of the thesaurus."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Test 1: Science policy\n",
      "plosthes_test_file = '../data/plosthes_test.xlsx'\n",
      "json.dumps(tree_from_spreadsheet(plosthes_test_file, df, verbose=True))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Test 2: An edited subset of Earth Sciences\n",
      "plosthes_test_file = '../data/plosthes_test_2.xlsx'\n",
      "json.dumps(tree_from_spreadsheet(plosthes_test_file, df, verbose=True))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### How to check for specific paths in the article data\n",
      "\n",
      "In case something weird shows up in the tree, like a bug or something. (We fixed it.)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.subject[df.subject.apply(lambda x: u'/Earth sciences/Mineralogy/Minerals/Gemstones/Diamonds' in x)]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Ready for full conversion and export?\n",
      "\n",
      "You can either run the cell below, or run the Python script `crunch_tree.py`. Warning: it takes a few hours. Don't forget to update the filename of the thesaurus spreadsheet if needed.\n",
      "\n",
      "This will not overwrite the existing JSON file used by the D3 visualisation, `data/plos_tree.json`. You need to do that yourself, if you are updating the tree."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Update this filename if you use a newer version!\n",
      "plosthes_full_file = '../data/plosthes.2014-2.full.xlsx'\n",
      "\n",
      "# Generate tree structure\n",
      "# Change to verbose=True if you want to see it happening. \n",
      "# (Fills up the output cell with ~10000 lines.)\n",
      "plos_tree = tree_from_spreadsheet(plosthes_full_file, df, verbose=False)\n",
      "\n",
      "# Export tree structure as JSON\n",
      "# Note: the D3 tree visualization uses plos_tree.json -- this script won't overwrite it.\n",
      "with open('../data/plos_hierarchy_full.json', 'wb') as f:\n",
      "    json.dump(plos_tree, f)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}