{ "metadata": { "name": "", "signature": "sha256:e31a8c6a81efd500c21e80980fa17fa68c31c0aaba3b3b38c73ae0a57aaa0315" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import unicode_literals\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "from pandas import DataFrame, Series\n", "import xlrd\n", "from collections import defaultdict" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Transforming the PLOS thesaurus\n", "\n", "The PLOS thesaurus was kindly provided to us as a spreadsheet with thousands of rows, one node per row. It is a polyhierarchy represented in the form of a tree. We need to transform it into a JSON object that also includes article counts for all the nodes in the tree.\n", "\n", "An example of the desired data structure for PLOS thesaurus:\n", "\n", "```\n", "{\"count\": #total, \"name\": \"PLOS\", \"children\": [\n", " {\"name\": \"Computer and information sciences\",\n", " \"count\": ###,\n", " \"children\": [\n", " {\"name\": \"Information technology\",\n", " \"count\": ###,\n", " \"children\": [\n", " {\"name\": \"Data mining\", \"count\": ###},\n", " {\"name\": \"Data reduction\", \"count\": ###},\n", " {\"name\": \"Databases\",\n", " \"count\": ###,\n", " \"children\": [\n", " {\"name\": \"Relational databases\", \"count\": ###}\n", " ]\n", " },\n", " ...,\n", " {\"name\": \"Text mining\",\"count\": ###} \n", " ]\n", " }\n", " ]\n", " }, \n", " ...\n", " ]\n", "}\n", "```\n", "\n", "Thinking about a Python data structure that could be output to JSON, each node would be a dict. Children are specified as a list of dicts." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import article data. This is a 400 MB file.\n", "df = pd.read_pickle('../data/all_plos_df.pkl')\n", "\n", "# Drop unused data\n", "df.drop(['author', 'title_display', 'journal', 'abstract', 'publication_date', 'score'], axis=1, inplace=True)\n", "df.set_index('id', inplace=True)\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject
id
10.1371/journal.pone.0100281 [/Medicine and health sciences/Pathology and l...
10.1371/journal.pone.0070935 [/Biology and life sciences/Molecular biology/...
10.1371/journal.pone.0066671 [/Biology and life sciences/Population biology...
10.1371/journal.pntd.0002444 [/Earth sciences/Geography/Geographic areas/Ru...
10.1371/journal.pone.0003605 [/Biology and life sciences/Cell biology/Cell ...
\n", "

5 rows \u00d7 1 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ " subject\n", "id \n", "10.1371/journal.pone.0100281 [/Medicine and health sciences/Pathology and l...\n", "10.1371/journal.pone.0070935 [/Biology and life sciences/Molecular biology/...\n", "10.1371/journal.pone.0066671 [/Biology and life sciences/Population biology...\n", "10.1371/journal.pntd.0002444 [/Earth sciences/Geography/Geographic areas/Ru...\n", "10.1371/journal.pone.0003605 [/Biology and life sciences/Cell biology/Cell ...\n", "\n", "[5 rows x 1 columns]" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# Let's make sure we are counting articles correctly for each subject node.\n", "\n", "def count_articles(df, subject_path):\n", " s = df.subject.apply(lambda s: str(s))\n", " matching = s[s.str.contains(subject_path)]\n", " return len(matching)\n", "\n", "print 'Total articles:', len(df)\n", "print 'Science policy:', count_articles(df, 'Science policy')\n", "print 'Science policy/Bioethics:', count_articles(df, 'Science policy/Bioethics')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Total articles: 121450\n", "Science policy: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "531\n", "Science policy/Bioethics: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "72\n" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "def tree_from_spreadsheet(f, df, verbose=False):\n", " \n", " subjects = df.subject.apply(lambda s: str(s))\n", " \n", " book = xlrd.open_workbook(f)\n", " pt = book.sheet_by_index(0)\n", " # spreadsheet cells : (row, col) :: cell A1 : (0, 0)\n", "\n", " # Initialize a list to contain the thesaurus.\n", " # Our test case will only have one item in this list.\n", " pt_list = []\n", "\n", " # Keep track of the path in the tree.\n", " cur_path = Series([np.nan]*10)\n", "\n", " for r in range(1, pt.nrows):\n", " # Start on row two.\n", "\n", " # Columns: the hierarchy goes up to 10 tiers.\n", " for c in range(10):\n", " if pt.cell_value(r, c):\n", " # If this condition is satisfied, we are at the node that's in this line.\n", "\n", " # Construct the path to this node.\n", " # Clean strings because some terms (RNA nomenclature) cause unicode error\n", " text = pt.cell_value(r, c).replace(u'\\u2019', \"'\")\n", " cur_path[c] = text\n", " cur_path[c+1:] = np.nan\n", " path_list = list(cur_path.dropna())\n", " tier = len(path_list)\n", " path_str = '/'.join(path_list)\n", " if verbose:\n", " print tier, path_str\n", "\n", " # Add the node to the JSON-like tree structure.\n", " node = defaultdict(list)\n", " node['name'] = text\n", " node['count']= len(subjects[subjects.str.contains(path_str)])\n", "\n", " # This part is completely ridiculous. But it seems to work.\n", " if tier == 1:\n", " pt_list.append(node)\n", " elif tier == 2:\n", " pt_list[-1]['children'].append(node)\n", " elif tier == 3:\n", " pt_list[-1]['children'][-1]['children'].append(node)\n", " elif tier == 4:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'].append(node)\n", " elif tier == 5:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n", " elif tier == 6:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n", " elif tier == 7:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n", " elif tier == 8:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n", " elif tier == 9:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n", " elif tier == 10:\n", " pt_list[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)\n", "\n", " # Go to next row after finding a term. There is only one term listed per row.\n", " break\n", "\n", " # Make a single JSON object to contain all the branches.\n", " pt_obj = {'count': len(df), 'name': 'PLOS', 'children': pt_list}\n", "\n", " return pt_obj" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prototyping\n", "\n", "Experimenting on smaller subsets of the thesaurus." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Test 1: Science policy\n", "plosthes_test_file = '../data/plosthes_test.xlsx'\n", "json.dumps(tree_from_spreadsheet(plosthes_test_file, df, verbose=True))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Test 2: An edited subset of Earth Sciences\n", "plosthes_test_file = '../data/plosthes_test_2.xlsx'\n", "json.dumps(tree_from_spreadsheet(plosthes_test_file, df, verbose=True))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to check for specific paths in the article data\n", "\n", "In case something weird shows up in the tree, like a bug or something. (We fixed it.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.subject[df.subject.apply(lambda x: u'/Earth sciences/Mineralogy/Minerals/Gemstones/Diamonds' in x)]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ready for full conversion and export?\n", "\n", "You can either run the cell below, or run the Python script `crunch_tree.py`. Warning: it takes a few hours. Don't forget to update the filename of the thesaurus spreadsheet if needed.\n", "\n", "This will not overwrite the existing JSON file used by the D3 visualisation, `data/plos_tree.json`. You need to do that yourself, if you are updating the tree." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Update this filename if you use a newer version!\n", "plosthes_full_file = '../data/plosthes.2014-2.full.xlsx'\n", "\n", "# Generate tree structure\n", "# Change to verbose=True if you want to see it happening. \n", "# (Fills up the output cell with ~10000 lines.)\n", "plos_tree = tree_from_spreadsheet(plosthes_full_file, df, verbose=False)\n", "\n", "# Export tree structure as JSON\n", "# Note: the D3 tree visualization uses plos_tree.json -- this script won't overwrite it.\n", "with open('../data/plos_hierarchy_full.json', 'wb') as f:\n", " json.dump(plos_tree, f)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }