{ "metadata": { "name": "", "signature": "sha256:6d14f92bdff26eefbe3be93f0262e35304958ddcab6570dd624168a2d5567e61" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "PLOS Cloud Explorer: The Process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is about our process for figuring out what PLOS Cloud Explorer was going to be. It includes early code, prototypes, and dead ends.\n", "\n", "For the full story including the happy ending, read [this document](https://github.com/cmgerber/PLOS_Cloud_Explorer/blob/master/README.md) and follow the other notebook links to see the code we actually used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First things first. All imports for this notebook:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import unicode_literals\n", "\n", "# You need an API Key for PLOS\n", "import settings\n", "\n", "# Data analysis\n", "import numpy as np\n", "import pandas as pd\n", "from numpy import nan\n", "from pandas import Series, DataFrame\n", "\n", "# Interacting with API\n", "import requests\n", "import urllib\n", "import time\n", "from retrying import retry\n", "import os\n", "import random\n", "import json\n", "\n", "# Natural language processing\n", "import nltk\n", "from nltk.collocations import BigramCollocationFinder\n", "from nltk.metrics import BigramAssocMeasures\n", "from nltk.corpus import stopwords\n", "import string\n", "\n", "# For the IPython widgets:\n", "from IPython.display import display, Image, HTML, clear_output\n", "from IPython.html import widgets\n", "from jinja2 import Template" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data Collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We began with a really simple way of getting article data from the PLOS Search API:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "r = requests.get('http://api.plos.org/search?q=subject:\"biotechnology\"&start=0&rows=500&api_key={%s}&wt=json' % settings.PLOS_KEY).json()\n", "len(r['response']['docs'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "500" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# Write out a file.\n", "with open('biotech500.json', 'wb') as fp:\n", " json.dump(r, fp)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We later developed a much more sophisticated way to get huge amounts of data from the API. To see how we collected data sets, see the \n", "[batch data collection notebook](http://nbviewer.ipython.org/github/cmgerber/PLOS_Cloud_Explorer/blob/master/ipython_notebooks/Batch_data_collection_full.ipynb)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exploring Output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we show what the output looks like, from a previously run API query. Through the magic of Python, we can pickle the resulting DataFrame and access it again now without making any API calls." ] }, { "cell_type": "code", "collapsed": false, "input": [ "abstract_df = pd.read_pickle('../data/abstract_df.pkl')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "len(list(abstract_df.author))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "1120" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "print list(abstract_df.subject)[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'/Computer and information sciences/Information technology/Data processing', u'/Computer and information sciences/Information technology/Data reduction', u'/Physical sciences/Mathematics/Statistics (mathematics)/Statistical methods', u'/Research and analysis methods/Mathematical and statistical techniques/Statistical methods', u'/Computer and information sciences/Information technology/Databases', u'/Physical sciences/Mathematics/Statistics (mathematics)/Statistical data', u'/Computer and information sciences/Computer architecture/User interfaces', u'/Medicine and health sciences/Infectious diseases/Infectious disease control', u'/Computer and information sciences/Data management']\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "abstract_df.tail()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abstractauthoridjournalpublication_datescoresubjecttitle_display
15 [\\nPopulation structure can confound the ident... [Jonathan Carlson, Carl Kadie, Simon Mallal, D... 10.1371/journal.pone.0000591 PLoS ONE 2007-07-04T00:00:00Z 0.443733 [/Biology and life sciences/Genetics/Phenotype... Leveraging Hierarchical Population Structure i...
16 [\\n The discrimination of thatcherized ... [Nick Donnelly, Nicole R Z\u00fcrcher, Katherine Co... 10.1371/journal.pone.0023340 PLoS ONE 2011-08-31T00:00:00Z 0.443733 [/Medicine and health sciences/Diagnostic medi... Discriminating Grotesque from Typical Faces: E...
17 [\\nInfluenza viruses have been responsible for... [Zhipeng Cai, Tong Zhang, Xiu-Feng Wan] 10.1371/journal.pcbi.1000949 PLoS Computational Biology 2010-10-07T00:00:00Z 0.443733 [/Biology and life sciences/Organisms/Viruses/... A Computational Framework for Influenza Antige...
18 [\\n Based on previous evidence for indi... [Luis F H Basile, Jo\u00e3o R Sato, Milkes Y Alvare... 10.1371/journal.pone.0059595 PLoS ONE 2013-03-27T00:00:00Z 0.443733 [/Medicine and health sciences/Diagnostic medi... Lack of Systematic Topographic Difference betw...
19 [Objective: Herpes simplex virus type 2 (HSV-2... [Alison C Roxby, Alison L Drake, Francisca Ong... 10.1371/journal.pone.0038622 PLoS ONE 2012-06-12T00:00:00Z 0.443733 [/Medicine and health sciences/Women's health/... Effects of Valacyclovir on Markers of Disease ...
\n", "

5 rows \u00d7 8 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ " abstract \\\n", "15 [\\nPopulation structure can confound the ident... \n", "16 [\\n The discrimination of thatcherized ... \n", "17 [\\nInfluenza viruses have been responsible for... \n", "18 [\\n Based on previous evidence for indi... \n", "19 [Objective: Herpes simplex virus type 2 (HSV-2... \n", "\n", " author \\\n", "15 [Jonathan Carlson, Carl Kadie, Simon Mallal, D... \n", "16 [Nick Donnelly, Nicole R Z\u00fcrcher, Katherine Co... \n", "17 [Zhipeng Cai, Tong Zhang, Xiu-Feng Wan] \n", "18 [Luis F H Basile, Jo\u00e3o R Sato, Milkes Y Alvare... \n", "19 [Alison C Roxby, Alison L Drake, Francisca Ong... \n", "\n", " id journal \\\n", "15 10.1371/journal.pone.0000591 PLoS ONE \n", "16 10.1371/journal.pone.0023340 PLoS ONE \n", "17 10.1371/journal.pcbi.1000949 PLoS Computational Biology \n", "18 10.1371/journal.pone.0059595 PLoS ONE \n", "19 10.1371/journal.pone.0038622 PLoS ONE \n", "\n", " publication_date score \\\n", "15 2007-07-04T00:00:00Z 0.443733 \n", "16 2011-08-31T00:00:00Z 0.443733 \n", "17 2010-10-07T00:00:00Z 0.443733 \n", "18 2013-03-27T00:00:00Z 0.443733 \n", "19 2012-06-12T00:00:00Z 0.443733 \n", "\n", " subject \\\n", "15 [/Biology and life sciences/Genetics/Phenotype... \n", "16 [/Medicine and health sciences/Diagnostic medi... \n", "17 [/Biology and life sciences/Organisms/Viruses/... \n", "18 [/Medicine and health sciences/Diagnostic medi... \n", "19 [/Medicine and health sciences/Women's health/... \n", "\n", " title_display \n", "15 Leveraging Hierarchical Population Structure i... \n", "16 Discriminating Grotesque from Typical Faces: E... \n", "17 A Computational Framework for Influenza Antige... \n", "18 Lack of Systematic Topographic Difference betw... \n", "19 Effects of Valacyclovir on Markers of Disease ... \n", "\n", "[5 rows x 8 columns]" ] } ], "prompt_number": 11 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Initial attempts to make word clouds using abstracts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We wanted to use basic natural language processing (NLP) to make word clouds out of aggregated abstract text, and see how they change over time.\n", "\n", "NB: These examples use a previously collected dataset that's different and smaller than the one we generated above." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Globally define a set of stopwords.\n", "stops = set(stopwords.words('english'))\n", "# We can add science-y stuff to it as well. Just an example:\n", "stops.add('conclusions')\n", "\n", "\n", "def wordify(abs_list, min_word_len=2):\n", " '''\n", " Convert the abstract field from PLoS API data to a filtered list of words.\n", " '''\n", "\n", " # The abstract field is a list. Make it a string.\n", " text = ' '.join(abs_list).strip(' \\n\\t')\n", "\n", " if text == '':\n", " return nan\n", "\n", " else:\n", " # Remove punctuation & replace with space,\n", " # because we want 'metal-contaminated' => 'metal contaminated'\n", " # ...not 'metalcontaminated', and so on.\n", " for c in string.punctuation:\n", " text = text.replace(c, ' ')\n", "\n", " # Now make it a Series of words, and do some cleaning.\n", " words = Series(text.split(' '))\n", " words = words.str.lower()\n", " # Filter out words less than minimum word length.\n", " words = words[words.str.len() >= min_word_len]\n", " words = words[~words.str.contains(r'[^#@a-z]')] # What exactly does this do?\n", "\n", " # Filter out globally-defined stopwords\n", " ignore = stops & set(words.unique())\n", " words_out = [w for w in words.tolist() if w not in ignore]\n", "\n", " return words_out\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load up some data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "with open('biotech500.json', 'rb') as fp:\n", " data = json.load(fp)\n", " \n", "articles_list = data['response']['docs']\n", "articles = DataFrame(articles_list)\n", "articles = articles[articles['abstract'].notnull()]\n", "articles.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abstractarticle_typeauthor_displayeissnidjournalpublication_datescoretitle_display
7 [\\nThe objective of this paper is to assess th... Research Article [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... 1932-6203 10.1371/journal.pone.0086174 PLoS ONE 2014-01-29T00:00:00Z 1.211935 Determinants of Public Attitudes to Geneticall...
16 [\\n Atrazine (ATZ) and S-metolachlor (S... Research Article [Cristina A. Viegas, Catarina Costa, Sandra An... 1932-6203 10.1371/journal.pone.0037140 PLoS ONE 2012-05-15T00:00:00Z 1.119538 Does <i>S</i>-Metolachlor Affect the Performan...
17 [\\nDue to environmental persistence and biotox... Research Article [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... 1932-6203 10.1371/journal.pone.0070686 PLoS ONE 2013-08-05T00:00:00Z 1.119538 Microbial Electricity Generation Enhances Deca...
34 [\\n Intensive use of chlorpyrifos has r... Research Article [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... 1932-6203 10.1371/journal.pone.0047205 NaN 2012-10-08T00:00:00Z 1.119538 Biodegradation of Chlorpyrifos and Its Hydroly...
35 [Background: The complex characteristics and u... Research Article [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... 1932-6203 10.1371/journal.pone.0042270 NaN 2012-08-09T00:00:00Z 0.989541 Microbial Transformation of Biomacromolecules ...
\n", "

5 rows \u00d7 9 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " abstract article_type \\\n", "7 [\\nThe objective of this paper is to assess th... Research Article \n", "16 [\\n Atrazine (ATZ) and S-metolachlor (S... Research Article \n", "17 [\\nDue to environmental persistence and biotox... Research Article \n", "34 [\\n Intensive use of chlorpyrifos has r... Research Article \n", "35 [Background: The complex characteristics and u... Research Article \n", "\n", " author_display eissn \\\n", "7 [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... 1932-6203 \n", "16 [Cristina A. Viegas, Catarina Costa, Sandra An... 1932-6203 \n", "17 [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... 1932-6203 \n", "34 [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... 1932-6203 \n", "35 [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... 1932-6203 \n", "\n", " id journal publication_date score \\\n", "7 10.1371/journal.pone.0086174 PLoS ONE 2014-01-29T00:00:00Z 1.211935 \n", "16 10.1371/journal.pone.0037140 PLoS ONE 2012-05-15T00:00:00Z 1.119538 \n", "17 10.1371/journal.pone.0070686 PLoS ONE 2013-08-05T00:00:00Z 1.119538 \n", "34 10.1371/journal.pone.0047205 NaN 2012-10-08T00:00:00Z 1.119538 \n", "35 10.1371/journal.pone.0042270 NaN 2012-08-09T00:00:00Z 0.989541 \n", "\n", " title_display \n", "7 Determinants of Public Attitudes to Geneticall... \n", "16 Does S-Metolachlor Affect the Performan... \n", "17 Microbial Electricity Generation Enhances Deca... \n", "34 Biodegradation of Chlorpyrifos and Its Hydroly... \n", "35 Microbial Transformation of Biomacromolecules ... \n", "\n", "[5 rows x 9 columns]" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying this to the whole DataFrame of articles" ] }, { "cell_type": "code", "collapsed": false, "input": [ "articles['words'] = articles.apply(lambda s: wordify(s['abstract'] + [s['title_display']]), axis=1)\n", "articles.drop(['article_type', 'score', 'title_display', 'abstract'], axis=1, inplace=True)\n", "articles.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
author_displayeissnidjournalpublication_datewords
7 [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... 1932-6203 10.1371/journal.pone.0086174 PLoS ONE 2014-01-29T00:00:00Z [objective, paper, assess, attitude, malaysian...
16 [Cristina A. Viegas, Catarina Costa, Sandra An... 1932-6203 10.1371/journal.pone.0037140 PLoS ONE 2012-05-15T00:00:00Z [atrazine, atz, metolachlor, met, two, herbici...
17 [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... 1932-6203 10.1371/journal.pone.0070686 PLoS ONE 2013-08-05T00:00:00Z [due, environmental, persistence, biotoxicity,...
34 [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... 1932-6203 10.1371/journal.pone.0047205 NaN 2012-10-08T00:00:00Z [intensive, use, chlorpyrifos, resulted, ubiqu...
35 [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... 1932-6203 10.1371/journal.pone.0042270 NaN 2012-08-09T00:00:00Z [background, complex, characteristics, unclear...
\n", "

5 rows \u00d7 6 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ " author_display eissn \\\n", "7 [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf... 1932-6203 \n", "16 [Cristina A. Viegas, Catarina Costa, Sandra An... 1932-6203 \n", "17 [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,... 1932-6203 \n", "34 [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong... 1932-6203 \n", "35 [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu... 1932-6203 \n", "\n", " id journal publication_date \\\n", "7 10.1371/journal.pone.0086174 PLoS ONE 2014-01-29T00:00:00Z \n", "16 10.1371/journal.pone.0037140 PLoS ONE 2012-05-15T00:00:00Z \n", "17 10.1371/journal.pone.0070686 PLoS ONE 2013-08-05T00:00:00Z \n", "34 10.1371/journal.pone.0047205 NaN 2012-10-08T00:00:00Z \n", "35 10.1371/journal.pone.0042270 NaN 2012-08-09T00:00:00Z \n", "\n", " words \n", "7 [objective, paper, assess, attitude, malaysian... \n", "16 [atrazine, atz, metolachlor, met, two, herbici... \n", "17 [due, environmental, persistence, biotoxicity,... \n", "34 [intensive, use, chlorpyrifos, resulted, ubiqu... \n", "35 [background, complex, characteristics, unclear... \n", "\n", "[5 rows x 6 columns]" ] } ], "prompt_number": 14 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Doing some natural language processing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "abs_df = DataFrame(articles['words'].apply(lambda x: ' '.join(x)).tolist(), columns=['text'])\n", "abs_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
0 objective paper assess attitude malaysian stak...
1 atrazine atz metolachlor met two herbicides wi...
2 due environmental persistence biotoxicity poly...
3 intensive use chlorpyrifos resulted ubiquitous...
4 background complex characteristics unclear bio...
\n", "

5 rows \u00d7 1 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ " text\n", "0 objective paper assess attitude malaysian stak...\n", "1 atrazine atz metolachlor met two herbicides wi...\n", "2 due environmental persistence biotoxicity poly...\n", "3 intensive use chlorpyrifos resulted ubiquitous...\n", "4 background complex characteristics unclear bio...\n", "\n", "[5 rows x 1 columns]" ] } ], "prompt_number": 15 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Common word pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section uses all words from abstracts to find the common word pairs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#include all words from abstracts for getting common word pairs\n", "words_all = pd.Series(' '.join(abs_df['text']).split(' '))\n", "words_all.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "study 56\n", "using 33\n", "two 32\n", "patients 31\n", "biodegradation 30\n", "non 29\n", "data 28\n", "three 28\n", "analysis 27\n", "compared 27\n", "soil 27\n", "new 27\n", "results 26\n", "species 25\n", "cell 25\n", "...\n", "engage 1\n", "thermal 1\n", "geochip 1\n", "dominant 1\n", "suggests 1\n", "third 1\n", "usually 1\n", "locomotion 1\n", "rpos 1\n", "scales 1\n", "prefer 1\n", "quite 1\n", "protocatechuate 1\n", "routine 1\n", "agr 1\n", "Length: 3028, dtype: int64" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "relevant_words_pairs = words_all.copy()\n", "relevant_words_pairs.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "study 56\n", "using 33\n", "two 32\n", "patients 31\n", "biodegradation 30\n", "non 29\n", "data 28\n", "three 28\n", "analysis 27\n", "compared 27\n", "soil 27\n", "new 27\n", "results 26\n", "species 25\n", "cell 25\n", "...\n", "engage 1\n", "thermal 1\n", "geochip 1\n", "dominant 1\n", "suggests 1\n", "third 1\n", "usually 1\n", "locomotion 1\n", "rpos 1\n", "scales 1\n", "prefer 1\n", "quite 1\n", "protocatechuate 1\n", "routine 1\n", "agr 1\n", "Length: 3028, dtype: int64" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "bcf = BigramCollocationFinder.from_words(relevant_words_pairs)\n", "for pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):\n", " print ' '.join(pair)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "synthetic biology\n", "spider silk\n", "es cell\n", "adjacent segment\n", "medical imaging\n", "dp dtmax\n", "security privacy\n", "industry backgrounds\n", "removal initiation\n", "uv irradiated\n", "gm salmon\n", "persistent crsab\n", "antimicrobial therapy\n", "limb amputation\n", "cellular phone\n", "wireless powered\n", "minimally invasive\n", "phone technology\n", "heavy metals\n", "battery powered\n", "composite mesh\n", "frequency currents\n", "genetically modified\n", "tissue engineering\n", "catheter removal\n", "acting reversible\n", "brassica napus\n", "brown streak\n", "quasi stiffness\n", "data code\n" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "bcf.nbest(BigramAssocMeasures.likelihood_ratio, 20)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "[(u'synthetic', u'biology'),\n", " (u'spider', u'silk'),\n", " (u'es', u'cell'),\n", " (u'adjacent', u'segment'),\n", " (u'medical', u'imaging'),\n", " (u'dp', u'dtmax'),\n", " (u'security', u'privacy'),\n", " (u'industry', u'backgrounds'),\n", " (u'removal', u'initiation'),\n", " (u'uv', u'irradiated'),\n", " (u'gm', u'salmon'),\n", " (u'persistent', u'crsab'),\n", " (u'antimicrobial', u'therapy'),\n", " (u'limb', u'amputation'),\n", " (u'cellular', u'phone'),\n", " (u'wireless', u'powered'),\n", " (u'minimally', u'invasive'),\n", " (u'phone', u'technology'),\n", " (u'heavy', u'metals'),\n", " (u'battery', u'powered')]" ] } ], "prompt_number": 19 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Making word clouds: select the top words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we takes only unique words from each abstract." ] }, { "cell_type": "code", "collapsed": false, "input": [ "abs_set_df = DataFrame(articles['words'].apply(lambda x: ' '.join(set(x))).tolist(), columns=['text'])\n", "abs_set_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
0 among developed attitude paper identify accept...
1 aquatic mineralization dose experiments still ...
2 mfc hypothesized distinctly results nitrogen s...
3 fungal contaminant tcp accumulative gc morphol...
4 origin humic mineralization show mainly result...
\n", "

5 rows \u00d7 1 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ " text\n", "0 among developed attitude paper identify accept...\n", "1 aquatic mineralization dose experiments still ...\n", "2 mfc hypothesized distinctly results nitrogen s...\n", "3 fungal contaminant tcp accumulative gc morphol...\n", "4 origin humic mineralization show mainly result...\n", "\n", "[5 rows x 1 columns]" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "words = pd.Series(' '.join(abs_set_df['text']).split(' '))\n", "words.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "study 38\n", "two 23\n", "using 21\n", "results 20\n", "three 20\n", "analysis 20\n", "compared 17\n", "used 16\n", "higher 16\n", "may 16\n", "non 15\n", "based 15\n", "significantly 14\n", "also 14\n", "however 14\n", "...\n", "septal 1\n", "recommendations 1\n", "genomes 1\n", "poking 1\n", "gck 1\n", "optimised 1\n", "varied 1\n", "counting 1\n", "monitoring 1\n", "malware 1\n", "tmc 1\n", "rape 1\n", "occur 1\n", "conversely 1\n", "cda 1\n", "Length: 3028, dtype: int64" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "top_words = words.value_counts().reset_index()\n", "top_words.columns = ['word', 'count']\n", "top_words.head(15)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcount
0 study 38
1 two 23
2 using 21
3 results 20
4 three 20
5 analysis 20
6 compared 17
7 used 16
8 higher 16
9 may 16
10 non 15
11 based 15
12 significantly 14
13 also 14
14 however 14
\n", "

15 rows \u00d7 2 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ " word count\n", "0 study 38\n", "1 two 23\n", "2 using 21\n", "3 results 20\n", "4 three 20\n", "5 analysis 20\n", "6 compared 17\n", "7 used 16\n", "8 higher 16\n", "9 may 16\n", "10 non 15\n", "11 based 15\n", "12 significantly 14\n", "13 also 14\n", "14 however 14\n", "\n", "[15 rows x 2 columns]" ] } ], "prompt_number": 22 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Exporting word count data as CSV for D3 word-cloudification" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# top_words.to_csv('../wordcloud2.csv', index=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Initial word cloud results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we created the word clouds, we noticed something about the most common words in these article abstracts... \n", "\n", "![cloud](../wordcloud_example_old.jpg)" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Change over time: working with article abstracts as time series data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "articles_list = data['response']['docs']\n", "articles = DataFrame(articles_list)\n", "articles = articles[articles['abstract'].notnull()].ix[:,['abstract', 'publication_date']]\n", "articles.abstract = articles.abstract.apply(wordify, 3)\n", "articles = articles[articles['abstract'].notnull()]\n", "articles.publication_date = pd.to_datetime(articles.publication_date)\n", "articles.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abstractpublication_date
7 [objective, paper, assess, attitude, malaysian...2014-01-29
16 [atrazine, atz, metolachlor, met, two, herbici...2012-05-15
17 [due, environmental, persistence, biotoxicity,...2013-08-05
34 [intensive, use, chlorpyrifos, resulted, ubiqu...2012-10-08
35 [background, complex, characteristics, unclear...2012-08-09
\n", "

5 rows \u00d7 2 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ " abstract publication_date\n", "7 [objective, paper, assess, attitude, malaysian... 2014-01-29\n", "16 [atrazine, atz, metolachlor, met, two, herbici... 2012-05-15\n", "17 [due, environmental, persistence, biotoxicity,... 2013-08-05\n", "34 [intensive, use, chlorpyrifos, resulted, ubiqu... 2012-10-08\n", "35 [background, complex, characteristics, unclear... 2012-08-09\n", "\n", "[5 rows x 2 columns]" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "print articles.publication_date.min(), articles.publication_date.max()\n", "print len(articles)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "2008-04-30 00:00:00 2014-04-11 00:00:00\n", "57\n" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The time series spans ~9 years with 57 data points. **We need to resample!**\n", "\n", "There are probably many ways to do this..." ] }, { "cell_type": "code", "collapsed": false, "input": [ "articles_timed = articles.set_index('publication_date')\n", "articles_timed.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abstract
publication_date
2014-01-29 [objective, paper, assess, attitude, malaysian...
2012-05-15 [atrazine, atz, metolachlor, met, two, herbici...
2013-08-05 [due, environmental, persistence, biotoxicity,...
2012-10-08 [intensive, use, chlorpyrifos, resulted, ubiqu...
2012-08-09 [background, complex, characteristics, unclear...
\n", "

5 rows \u00d7 1 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ " abstract\n", "publication_date \n", "2014-01-29 [objective, paper, assess, attitude, malaysian...\n", "2012-05-15 [atrazine, atz, metolachlor, met, two, herbici...\n", "2013-08-05 [due, environmental, persistence, biotoxicity,...\n", "2012-10-08 [intensive, use, chlorpyrifos, resulted, ubiqu...\n", "2012-08-09 [background, complex, characteristics, unclear...\n", "\n", "[5 rows x 1 columns]" ] } ], "prompt_number": 26 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Using pandas time series resampling functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `sum` aggregation method works because all the values were lists. The three abstracts published in 2013-05 were concatenated together (see below)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "articles_monthly = articles_timed.resample('M', how='sum', fill_method='ffill', kind='period')\n", "articles_monthly.abstract = articles_monthly.abstract.apply(lambda x: np.nan if x == 0 else x)\n", "articles_monthly.fillna(method='ffill', inplace=True)\n", "articles_monthly.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abstract
publication_date
2008-04 [according, world, health, organization, repor...
2008-05 [according, world, health, organization, repor...
2008-06 [according, world, health, organization, repor...
2008-07 [according, world, health, organization, repor...
2008-08 [according, world, health, organization, repor...
\n", "

5 rows \u00d7 1 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ " abstract\n", "publication_date \n", "2008-04 [according, world, health, organization, repor...\n", "2008-05 [according, world, health, organization, repor...\n", "2008-06 [according, world, health, organization, repor...\n", "2008-07 [according, world, health, organization, repor...\n", "2008-08 [according, world, health, organization, repor...\n", "\n", "[5 rows x 1 columns]" ] } ], "prompt_number": 27 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Making a time slider for abstract text" ] }, { "cell_type": "code", "collapsed": false, "input": [ "widgetmax = len(articles_monthly) - 1\n", "\n", "def textbarf(t): \n", " html_template = \"\"\"\n", " \n", "
{{blargh}}
\"\"\"\n", "\n", " blob = ' '.join(articles_monthly.ix[t]['abstract'])\n", " html_src = Template(html_template).render(blargh=blob)\n", " display(HTML(html_src))\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "widgets.interact(textbarf,\n", " t=widgets.IntSliderWidget(min=0,max=widgetmax,step=1,value=42),\n", " )" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", "
concerns regarding commercial release genetically engineered ge crops include naturalization introgression sexually compatible relatives transfer beneficial traits native weedy species hybridization date documented reports escape leading researchers question environmental risks biotech products study conducted systematic roadside survey canola brassica napus populations growing outside cultivation north dakota usa dominant canola growing region document presence two escaped transgenic genotypes well non ge canola provide evidence novel combinations transgenic forms wild results demonstrate feral populations large widespread moreover flowering times escaped populations well fertile condition majority collections suggest populations established persistent outside cultivation
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 29, "text": [ "" ] } ], "prompt_number": 29 } ], "metadata": {} } ] }