{ "metadata": { "name": "", "signature": "sha256:fb602b8501cfd3896a691c06e390c63d56d5fd9d64a5cf07fc2057c44adecd4e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import unicode_literals\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "from pandas import DataFrame, Series\n", "import xlrd\n", "from collections import defaultdict" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Transforming the PLOS thesaurus\n", "\n", "The PLOS thesaurus was kindly provided to us as a spreadsheet with thousands of rows, one node per row. It is a polyhierarchy represented in the form of a tree. We need to transform it into a JSON object that also includes article counts for all the nodes in the tree.\n", "\n", "An example of the desired data structure for PLOS thesaurus:\n", "\n", "```\n", "{\"count\": #total, \"name\": \"PLOS\", \"children\": [\n", " {\"name\": \"Computer and information sciences\",\n", " \"count\": ###,\n", " \"children\": [\n", " {\"name\": \"Information technology\",\n", " \"count\": ###,\n", " \"children\": [\n", " {\"name\": \"Data mining\", \"count\": ###},\n", " {\"name\": \"Data reduction\", \"count\": ###},\n", " {\"name\": \"Databases\",\n", " \"count\": ###,\n", " \"children\": [\n", " {\"name\": \"Relational databases\", \"count\": ###}\n", " ]\n", " },\n", " ...,\n", " {\"name\": \"Text mining\",\"count\": ###} \n", " ]\n", " }\n", " ]\n", " }, \n", " ...\n", " ]\n", "}\n", "```\n", "\n", "In Python, each node is a dict. Children are specified as a list of dicts. The whole thing is a list of nodes, therefore, a list of dicts." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import article data\n", "df = pd.read_pickle('../data/all_plos_df.pkl')\n", "\n", "# Drop unused data\n", "df.drop(['author', 'title_display', 'journal', 'abstract', 'publication_date', 'score'], axis=1, inplace=True)\n", "df.set_index('id', inplace=True)\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | subject | \n", "
---|---|
id | \n", "\n", " |
10.1371/journal.pone.0008858 | \n", "[/Biology and life sciences/Biochemistry/Prote... | \n", "
10.1371/journal.pone.0004722 | \n", "[/Biology and life sciences/Cell biology/Cellu... | \n", "
10.1371/journal.pone.0076865 | \n", "[/Biology and life sciences/Biochemistry/DNA/D... | \n", "
10.1371/journal.pbio.0040157 | \n", "[/Research and analysis methods/Research asses... | \n", "
10.1371/journal.pone.0080851 | \n", "[/Biology and life sciences/Biochemistry/Prote... | \n", "
5 rows \u00d7 1 columns
\n", "