{ "metadata": { "name": "", "signature": "sha256:6d14f92bdff26eefbe3be93f0262e35304958ddcab6570dd624168a2d5567e61" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "PLOS Cloud Explorer: The Process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is about our process for figuring out what PLOS Cloud Explorer was going to be. It includes early code, prototypes, and dead ends.\n", "\n", "For the full story including the happy ending, read [this document](https://github.com/cmgerber/PLOS_Cloud_Explorer/blob/master/README.md) and follow the other notebook links to see the code we actually used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First things first. All imports for this notebook:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import unicode_literals\n", "\n", "# You need an API Key for PLOS\n", "import settings\n", "\n", "# Data analysis\n", "import numpy as np\n", "import pandas as pd\n", "from numpy import nan\n", "from pandas import Series, DataFrame\n", "\n", "# Interacting with API\n", "import requests\n", "import urllib\n", "import time\n", "from retrying import retry\n", "import os\n", "import random\n", "import json\n", "\n", "# Natural language processing\n", "import nltk\n", "from nltk.collocations import BigramCollocationFinder\n", "from nltk.metrics import BigramAssocMeasures\n", "from nltk.corpus import stopwords\n", "import string\n", "\n", "# For the IPython widgets:\n", "from IPython.display import display, Image, HTML, clear_output\n", "from IPython.html import widgets\n", "from jinja2 import Template" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data Collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We began with a really simple way of getting article data from the PLOS Search API:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "r = requests.get('http://api.plos.org/search?q=subject:\"biotechnology\"&start=0&rows=500&api_key={%s}&wt=json' % settings.PLOS_KEY).json()\n", "len(r['response']['docs'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "500" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# Write out a file.\n", "with open('biotech500.json', 'wb') as fp:\n", " json.dump(r, fp)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We later developed a much more sophisticated way to get huge amounts of data from the API. To see how we collected data sets, see the \n", "[batch data collection notebook](http://nbviewer.ipython.org/github/cmgerber/PLOS_Cloud_Explorer/blob/master/ipython_notebooks/Batch_data_collection_full.ipynb)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exploring Output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we show what the output looks like, from a previously run API query. Through the magic of Python, we can pickle the resulting DataFrame and access it again now without making any API calls." ] }, { "cell_type": "code", "collapsed": false, "input": [ "abstract_df = pd.read_pickle('../data/abstract_df.pkl')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "len(list(abstract_df.author))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "1120" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "print list(abstract_df.subject)[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'/Computer and information sciences/Information technology/Data processing', u'/Computer and information sciences/Information technology/Data reduction', u'/Physical sciences/Mathematics/Statistics (mathematics)/Statistical methods', u'/Research and analysis methods/Mathematical and statistical techniques/Statistical methods', u'/Computer and information sciences/Information technology/Databases', u'/Physical sciences/Mathematics/Statistics (mathematics)/Statistical data', u'/Computer and information sciences/Computer architecture/User interfaces', u'/Medicine and health sciences/Infectious diseases/Infectious disease control', u'/Computer and information sciences/Data management']\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "abstract_df.tail()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | abstract | \n", "author | \n", "id | \n", "journal | \n", "publication_date | \n", "score | \n", "subject | \n", "title_display | \n", "
---|---|---|---|---|---|---|---|---|
15 | \n", "[\\nPopulation structure can confound the ident... | \n", "[Jonathan Carlson, Carl Kadie, Simon Mallal, D... | \n", "10.1371/journal.pone.0000591 | \n", "PLoS ONE | \n", "2007-07-04T00:00:00Z | \n", "0.443733 | \n", "[/Biology and life sciences/Genetics/Phenotype... | \n", "Leveraging Hierarchical Population Structure i... | \n", "
16 | \n", "[\\n The discrimination of thatcherized ... | \n", "[Nick Donnelly, Nicole R Z\u00fcrcher, Katherine Co... | \n", "10.1371/journal.pone.0023340 | \n", "PLoS ONE | \n", "2011-08-31T00:00:00Z | \n", "0.443733 | \n", "[/Medicine and health sciences/Diagnostic medi... | \n", "Discriminating Grotesque from Typical Faces: E... | \n", "
17 | \n", "[\\nInfluenza viruses have been responsible for... | \n", "[Zhipeng Cai, Tong Zhang, Xiu-Feng Wan] | \n", "10.1371/journal.pcbi.1000949 | \n", "PLoS Computational Biology | \n", "2010-10-07T00:00:00Z | \n", "0.443733 | \n", "[/Biology and life sciences/Organisms/Viruses/... | \n", "A Computational Framework for Influenza Antige... | \n", "
18 | \n", "[\\n Based on previous evidence for indi... | \n", "[Luis F H Basile, Jo\u00e3o R Sato, Milkes Y Alvare... | \n", "10.1371/journal.pone.0059595 | \n", "PLoS ONE | \n", "2013-03-27T00:00:00Z | \n", "0.443733 | \n", "[/Medicine and health sciences/Diagnostic medi... | \n", "Lack of Systematic Topographic Difference betw... | \n", "
19 | \n", "[Objective: Herpes simplex virus type 2 (HSV-2... | \n", "[Alison C Roxby, Alison L Drake, Francisca Ong... | \n", "10.1371/journal.pone.0038622 | \n", "PLoS ONE | \n", "2012-06-12T00:00:00Z | \n", "0.443733 | \n", "[/Medicine and health sciences/Women's health/... | \n", "Effects of Valacyclovir on Markers of Disease ... | \n", "
5 rows \u00d7 8 columns
\n", "\n", " | abstract | \n", "article_type | \n", "author_display | \n", "eissn | \n", "id | \n", "journal | \n", "publication_date | \n", "score | \n", "title_display | \n", "
---|---|---|---|---|---|---|---|---|---|
7 | \n", "[\\nThe objective of this paper is to assess th... | \n", "Research Article | \n", "[Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... | \n", "1932-6203 | \n", "10.1371/journal.pone.0086174 | \n", "PLoS ONE | \n", "2014-01-29T00:00:00Z | \n", "1.211935 | \n", "Determinants of Public Attitudes to Geneticall... | \n", "
16 | \n", "[\\n Atrazine (ATZ) and S-metolachlor (S... | \n", "Research Article | \n", "[Cristina A. Viegas, Catarina Costa, Sandra An... | \n", "1932-6203 | \n", "10.1371/journal.pone.0037140 | \n", "PLoS ONE | \n", "2012-05-15T00:00:00Z | \n", "1.119538 | \n", "Does <i>S</i>-Metolachlor Affect the Performan... | \n", "
17 | \n", "[\\nDue to environmental persistence and biotox... | \n", "Research Article | \n", "[Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... | \n", "1932-6203 | \n", "10.1371/journal.pone.0070686 | \n", "PLoS ONE | \n", "2013-08-05T00:00:00Z | \n", "1.119538 | \n", "Microbial Electricity Generation Enhances Deca... | \n", "
34 | \n", "[\\n Intensive use of chlorpyrifos has r... | \n", "Research Article | \n", "[Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... | \n", "1932-6203 | \n", "10.1371/journal.pone.0047205 | \n", "NaN | \n", "2012-10-08T00:00:00Z | \n", "1.119538 | \n", "Biodegradation of Chlorpyrifos and Its Hydroly... | \n", "
35 | \n", "[Background: The complex characteristics and u... | \n", "Research Article | \n", "[Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... | \n", "1932-6203 | \n", "10.1371/journal.pone.0042270 | \n", "NaN | \n", "2012-08-09T00:00:00Z | \n", "0.989541 | \n", "Microbial Transformation of Biomacromolecules ... | \n", "
5 rows \u00d7 9 columns
\n", "\n", " | author_display | \n", "eissn | \n", "id | \n", "journal | \n", "publication_date | \n", "words | \n", "
---|---|---|---|---|---|---|
7 | \n", "[Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... | \n", "1932-6203 | \n", "10.1371/journal.pone.0086174 | \n", "PLoS ONE | \n", "2014-01-29T00:00:00Z | \n", "[objective, paper, assess, attitude, malaysian... | \n", "
16 | \n", "[Cristina A. Viegas, Catarina Costa, Sandra An... | \n", "1932-6203 | \n", "10.1371/journal.pone.0037140 | \n", "PLoS ONE | \n", "2012-05-15T00:00:00Z | \n", "[atrazine, atz, metolachlor, met, two, herbici... | \n", "
17 | \n", "[Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... | \n", "1932-6203 | \n", "10.1371/journal.pone.0070686 | \n", "PLoS ONE | \n", "2013-08-05T00:00:00Z | \n", "[due, environmental, persistence, biotoxicity,... | \n", "
34 | \n", "[Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... | \n", "1932-6203 | \n", "10.1371/journal.pone.0047205 | \n", "NaN | \n", "2012-10-08T00:00:00Z | \n", "[intensive, use, chlorpyrifos, resulted, ubiqu... | \n", "
35 | \n", "[Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... | \n", "1932-6203 | \n", "10.1371/journal.pone.0042270 | \n", "NaN | \n", "2012-08-09T00:00:00Z | \n", "[background, complex, characteristics, unclear... | \n", "
5 rows \u00d7 6 columns
\n", "\n", " | text | \n", "
---|---|
0 | \n", "objective paper assess attitude malaysian stak... | \n", "
1 | \n", "atrazine atz metolachlor met two herbicides wi... | \n", "
2 | \n", "due environmental persistence biotoxicity poly... | \n", "
3 | \n", "intensive use chlorpyrifos resulted ubiquitous... | \n", "
4 | \n", "background complex characteristics unclear bio... | \n", "
5 rows \u00d7 1 columns
\n", "\n", " | text | \n", "
---|---|
0 | \n", "among developed attitude paper identify accept... | \n", "
1 | \n", "aquatic mineralization dose experiments still ... | \n", "
2 | \n", "mfc hypothesized distinctly results nitrogen s... | \n", "
3 | \n", "fungal contaminant tcp accumulative gc morphol... | \n", "
4 | \n", "origin humic mineralization show mainly result... | \n", "
5 rows \u00d7 1 columns
\n", "\n", " | word | \n", "count | \n", "
---|---|---|
0 | \n", "study | \n", "38 | \n", "
1 | \n", "two | \n", "23 | \n", "
2 | \n", "using | \n", "21 | \n", "
3 | \n", "results | \n", "20 | \n", "
4 | \n", "three | \n", "20 | \n", "
5 | \n", "analysis | \n", "20 | \n", "
6 | \n", "compared | \n", "17 | \n", "
7 | \n", "used | \n", "16 | \n", "
8 | \n", "higher | \n", "16 | \n", "
9 | \n", "may | \n", "16 | \n", "
10 | \n", "non | \n", "15 | \n", "
11 | \n", "based | \n", "15 | \n", "
12 | \n", "significantly | \n", "14 | \n", "
13 | \n", "also | \n", "14 | \n", "
14 | \n", "however | \n", "14 | \n", "
15 rows \u00d7 2 columns
\n", "\n", " | abstract | \n", "publication_date | \n", "
---|---|---|
7 | \n", "[objective, paper, assess, attitude, malaysian... | \n", "2014-01-29 | \n", "
16 | \n", "[atrazine, atz, metolachlor, met, two, herbici... | \n", "2012-05-15 | \n", "
17 | \n", "[due, environmental, persistence, biotoxicity,... | \n", "2013-08-05 | \n", "
34 | \n", "[intensive, use, chlorpyrifos, resulted, ubiqu... | \n", "2012-10-08 | \n", "
35 | \n", "[background, complex, characteristics, unclear... | \n", "2012-08-09 | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | abstract | \n", "
---|---|
publication_date | \n", "\n", " |
2014-01-29 | \n", "[objective, paper, assess, attitude, malaysian... | \n", "
2012-05-15 | \n", "[atrazine, atz, metolachlor, met, two, herbici... | \n", "
2013-08-05 | \n", "[due, environmental, persistence, biotoxicity,... | \n", "
2012-10-08 | \n", "[intensive, use, chlorpyrifos, resulted, ubiqu... | \n", "
2012-08-09 | \n", "[background, complex, characteristics, unclear... | \n", "
5 rows \u00d7 1 columns
\n", "\n", " | abstract | \n", "
---|---|
publication_date | \n", "\n", " |
2008-04 | \n", "[according, world, health, organization, repor... | \n", "
2008-05 | \n", "[according, world, health, organization, repor... | \n", "
2008-06 | \n", "[according, world, health, organization, repor... | \n", "
2008-07 | \n", "[according, world, health, organization, repor... | \n", "
2008-08 | \n", "[according, world, health, organization, repor... | \n", "
5 rows \u00d7 1 columns
\n", "