{ "cells": [ { "cell_type": "markdown", "metadata": { "focus": false, "id": "e8ba98da-bfce-4650-b83e-4eb88f375205" }, "source": [ "# Parsing Pubmed XML file\n", "\n", "Data sets can be found [here](https://github.com/kescobo/gender-comp-bio/tree/master/data).\n", "\n", "## Goals\n", "\n", "Raw data from pubmed is contained in xml files, and we'd like to extract author and date information into a spreadsheet for easier analysis. \n", "\n", "The first thing to do is to make sure to set the working directory to where the data is." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "focus": false, "id": "7cfca9c8-3ad5-44a6-b65a-03b1ef30f66e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/ksb/computation/science/gender-comp-bio/notebooks\n" ] } ], "source": [ "import os\n", "\n", "print(os.getcwd())" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "focus": false, "id": "94756e02-50fa-415c-87b6-f3fcadeefb1b" }, "outputs": [ { "data": { "text/plain": [ "['arxiv-bio-data.xml', 'arxiv-data.xml', 'bio.xml', 'comp.xml']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.chdir(\"../data/pubdata\")\n", "os.listdir()" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "b0954aba-f77d-4e3c-839a-282cc275ec6d" }, "source": [ "Next, we'll need to parse the xml files. Since several of the data fles are huge, we don't want to use the [python xml module](https://docs.python.org/3.5/library/xml.etree.elementtree.html), which would require loading the entire contents of the file into memory. Instead, we'll use [`lxml.etree.iterparse()`](http://effbot.org/zone/element-iterparse.htm), which will allow us to grab one article at a time, grab its info, then clear it from memory." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "focus": false, "id": "5b2040ba-4302-4ad6-a2ce-23bf40975829" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "26605382, IEEE/ACM Trans Comput Biol Bioinform:[('Huang', 'Yufei'), ('Chen', 'Yidong'), ('Qian', 'Xiaoning')]\n", "26357062, IEEE/ACM Trans Comput Biol Bioinform:[('Wang', 'Haiying'), ('Zheng', 'Huiru')]\n", "26357061, IEEE/ACM Trans Comput Biol Bioinform:[('Loohuis', 'Loes Olde'), ('Witzel', 'Andreas'), ('Mishra', 'Bud')]\n", "26357060, IEEE/ACM Trans Comput Biol Bioinform:[('Kobayashi', 'Koichi'), ('Hiraishi', 'Kunihiko')]\n", "26357059, IEEE/ACM Trans Comput Biol Bioinform:[('Ahmed', 'Hasin Afzal'), ('Mahanta', 'Priyakshi'), ('Bhattacharyya', 'Dhruba Kumar'), ('Kalita', 'Jugal Kumar')]\n", "26357058, IEEE/ACM Trans Comput Biol Bioinform:[('Disanto', 'Filippo'), ('Rosenberg', 'Noah A')]\n" ] } ], "source": [ "import lxml.etree as ET\n", "import datetime\n", "\n", "i = 0\n", "for event, element in ET.iterparse('comp.xml', tag=\"PubmedArticle\", events=(\"end\",)):\n", " i+=1\n", " element.xpath('.//DateCreated/Year')[0].text\n", "\n", " pmid = element.xpath('.//PMID')[0].text\n", " pubdate = datetime.date(\n", " int(element.xpath('.//DateCreated/Year')[0].text), # year\n", " int(element.xpath('.//DateCreated/Month')[0].text), # month\n", " int(element.xpath('.//DateCreated/Day')[0].text), #day\n", " )\n", "\n", "\n", " journal = element.xpath('.//Journal//ISOAbbreviation')\n", " if journal:\n", " journal = journal[0].text\n", " else:\n", " journal = None\n", "\n", " title = element.xpath('.//Article/ArticleTitle')\n", " if title:\n", " title = title[0].text\n", " else:\n", " title = None\n", "\n", " abstract = element.xpath('.//Article/Abstract')\n", " if abstract:\n", " abstract = abstract[0].text\n", " else:\n", " abstract = None\n", "\n", " author_records = element.xpath('.//Article/AuthorList/Author')\n", " authors = []\n", " for name in author_records:\n", " try:\n", " authors.append((name[0].text, name[1].text))\n", " except IndexError:\n", " pass\n", " \n", " print(\"{}, {}:{}\".format(pmid, journal, authors))\n", " \n", " element.clear()\n", " if i > 5:\n", " break" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "ff0b5b58-7cf4-42b3-962f-2654786bce29" }, "source": [ "### Class Definition\n", "\n", "Just because I need the practice, I'm going to set up an `Article` class to hold the data and make working with it easier, and an `Author` class that we can use to deal with author names" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "focus": false, "id": "24ad7a88-f16f-4f37-be5b-eed3d28d8eb0" }, "outputs": [], "source": [ "class Article(object):\n", " \"\"\"Container for publication info\"\"\"\n", " def __init__(self, article_id, pubdate, journal, title, abstract, authors):\n", " self.article_id = article_id\n", " self.pubdate = pubdate\n", " self.journal = journal\n", " self.title = title\n", " self.abstract = abstract\n", " self.authors = authors\n", " def __repr__(self):\n", " return \"
\".format(self.article_id)\n", "\n", " def get_authors(self):\n", " for author in self.authors:\n", " yield author\n", "\n", "class Author(object):\n", " def __init__(self, last_name, first_name=None):\n", " assert type(last_name) == str\n", " self.last_name = last_name\n", "\n", " if first_name:\n", " assert type(first_name) == str\n", " self.first_name = first_name.split()[0]\n", " try:\n", " self.initials = \" \".join(first_name.split()[1:])\n", " except IndexError:\n", " self.initials = None\n", " else:\n", " self.first_name = None\n", " self.initials = None" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "c3e08f0e-ee36-46ae-9ac8-28093137b546" }, "source": [ "### Generator Function\n", "\n", "And... we can turn the code above into a generator function that yields an `Article` for each document" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "focus": false, "id": "e9204973-3f58-4dcf-8830-4f82f1b146a6" }, "outputs": [], "source": [ "from lxml.etree import iterparse\n", "\n", "def iter_parse_pubmed(xml_file):\n", " # get an iterable\n", " for event, element in iterparse(xml_file, tag=\"PubmedArticle\", events=(\"end\",)):\n", "\n", " pmid = element.xpath('.//PMID')[0].text\n", " pubdate = datetime.date(\n", " int(element.xpath('.//DateCreated/Year')[0].text), # year\n", " int(element.xpath('.//DateCreated/Month')[0].text), # month\n", " int(element.xpath('.//DateCreated/Day')[0].text), #day\n", " )\n", "\n", "\n", " journal = element.xpath('.//Journal//ISOAbbreviation')\n", " if journal:\n", " journal = journal[0].text\n", " else:\n", " journal = None\n", "\n", " title = element.xpath('.//Article/ArticleTitle')\n", " if title:\n", " title = title[0].text\n", " else:\n", " title = None\n", "\n", " abstract = element.xpath('.//Article/Abstract')\n", " if abstract:\n", " abstract = abstract[0].text\n", " else:\n", " abstract = None\n", "\n", " author_records = element.xpath('.//Article/AuthorList/Author')\n", " authors = []\n", " for name in author_records:\n", " try:\n", " authors.append(Author(name[0].text, name[1].text))\n", " except IndexError:\n", " pass\n", "\n", " element.clear()\n", "\n", " yield Article(pmid, pubdate, journal, title, abstract, authors)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "focus": false, "id": "c37928a3-710a-4284-af67-879adf9986ce" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iter_parse_pubmed('comp.xml')" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "774617cc-a0d3-45dd-bc10-86e6a99e1583" }, "source": [ "Usage:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "focus": false, "id": "b9ec6b8d-7135-4fcd-b82a-2b2f5903754e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\n", "2015-11-24\n", "Huang, Yufei \n", "Chen, Yidong \n", "Qian, Xiaoning \n", "\n", "
\n", "2015-09-11\n", "Wang, Haiying \n", "Zheng, Huiru \n", "\n", "
\n", "2015-09-11\n", "Loohuis, Loes Olde\n", "Witzel, Andreas \n", "Mishra, Bud \n", "\n", "
\n", "2015-09-11\n", "Kobayashi, Koichi \n", "Hiraishi, Kunihiko \n", "\n", "
\n", "2015-09-11\n", "Ahmed, Hasin Afzal\n", "Mahanta, Priyakshi \n", "Bhattacharyya, Dhruba Kumar\n", "Kalita, Jugal Kumar\n", "\n", "
\n", "2015-09-11\n", "Disanto, Filippo \n", "Rosenberg, Noah A\n", "\n" ] } ], "source": [ "i = 0\n", "for article in iter_parse_pubmed('comp.xml'):\n", " i+=1\n", " print(article)\n", " print(article.pubdate)\n", " for author in article.get_authors():\n", " print(\"{}, {} {}\".format(author.last_name, author.first_name, author.initials))\n", " print()\n", " if i > 5:\n", " break" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "4cff5edd-de0e-4ff0-b81b-599aeae1288e" }, "source": [ "### Getting Author Order\n", "\n", "Author position matters, but it matters in sort of a weird way - first author and last author are most important, then decreasing as you work your way in to the middle of the list. But practically, there's not much distinction between 3rd and 4th author (or 3rd from last and 4th from last), so we'll generate scores for first, second, last, penultimate and everyone else. The trick is to avoid index errors if the author list is smaller than 5, so we need to write up some special cases. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "focus": false, "id": "a80f22db-4ce7-4e43-a836-1d6fe1d918fc" }, "outputs": [], "source": [ "def score_authors(author_list):\n", " if not author_list:\n", " first = None\n", " else:\n", " first = author_list[0]\n", " others, penultimate, second, last = None, None, None, None\n", " \n", " list_length = len(author_list)\n", " if list_length > 4:\n", " others = [author for author in author_list[2:-2]]\n", " if list_length > 3:\n", " penultimate = author_list[-2]\n", " if list_length > 2:\n", " second = author_list[1]\n", " if list_length > 1:\n", " last = author_list[-1]\n", " \n", "\n", " return first, last, second, penultimate, others" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "e3c3f9d9-f7be-404f-96b0-e203a1f23b0d" }, "source": [ "### DataFrame generation\n", "\n", "In order to get the data into a usable spreadsheet-like form, and for later analysis, I'm going to use the `DataFrame`s from the [pandas](http://pandas.pydata.org/) package. This might be overkill, but I know how to use it (sort of). " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "focus": false, "id": "d671ba8e-a2ba-4fd1-97c8-2e5943f6b501" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Date Journal Author Name \\\n", "26605382 2015-11-24 IEEE/ACM Trans Comput Biol Bioinform Xiaoning \n", "26605382 2015-11-24 IEEE/ACM Trans Comput Biol Bioinform Yidong \n", "26357062 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Haiying \n", "26357062 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Huiru \n", "26357061 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Loes \n", "26357061 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Bud \n", "26357061 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Andreas \n", "26357060 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Koichi \n", "26357060 2015-09-11 IEEE/ACM Trans Comput Biol Bioinform Kunihiko \n", "\n", " Position \n", "26605382 last \n", "26605382 second \n", "26357062 first \n", "26357062 last \n", "26357061 first \n", "26357061 last \n", "26357061 second \n", "26357060 first \n", "26357060 last \n" ] } ], "source": [ "import pandas as pd\n", "\n", "col_names = [\"Date\", \"Journal\", \"Author Name\", \"Position\"]\n", "df = pd.DataFrame(columns=col_names)\n", "\n", "i = 0\n", "for article in iter_parse_pubmed('comp.xml'):\n", " i+=1\n", " first, last, second, penultimate, others = score_authors(article.authors)\n", " if first:\n", " row = pd.Series([str(article.pubdate)), article.journal, first.first_name, \"first\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " else:\n", " continue\n", " try:\n", " row = pd.Series([str(article.pubdate), article.journal, last.first_name, \"last\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " try:\n", " row = pd.Series([str(article.pubdate), article.journal, second.first_name, \"second\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " try:\n", " row = pd.Series([str(article.pubdate), article.journal, penultimate.first_name, \"penultimate\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " try:\n", " for x in others:\n", " row = pd.Series([str(article.pubdate), article.journal, x.first_name, \"other\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " \n", " if i > 5:\n", " break\n", " \n", "print(df[1:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ArXiv Data\n", "\n", "Data downloaded from the [arXiv](http://www.arxiv.org) preprint server is formatted a bit differently, so I'll write a parser that looks a lot `like iter_parse_pubmed()` - since they don't really have journals, I'm instead going to include a list of subject tags in place of the journal. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def iter_parse_arxiv(xml_file):\n", " print(\"parsing!\")\n", " ns = {\n", " \"a\":\"http://arxiv.org/OAI/arXiv/\",\n", " \"o\":\"http://www.openarchives.org/OAI/2.0/\"}\n", " for event, element in iterparse(xml_file, tag= \"{http://www.openarchives.org/OAI/2.0/}record\", events=(\"end\",)):\n", " ident = element.xpath('./o:header/o:identifier', namespaces = ns)[0].text\n", "\n", " pubdate = element.xpath('.//o:datestamp', namespaces = ns)[0].text.split(\"-\")\n", " pubdate = datetime.date(*[int(d) for d in pubdate])\n", "\n", " author_records = element.xpath('.//o:metadata//a:authors/a:author', namespaces = ns)\n", " authors = []\n", " for name in author_records:\n", " last_name = name.xpath('./a:keyname', namespaces = ns)[0].text\n", " try:\n", " first_name = name.xpath('./a:forenames', namespaces = ns)[0].text\n", " except IndexError:\n", " first_name = None\n", " try:\n", " authors.append(Author(last_name, first_name))\n", " except IndexError:\n", " pass\n", "\n", " try:\n", " title = element.xpath('.//o:metadata//a:title', namespaces = ns)[0].text\n", " except IndexError:\n", " title = None\n", " try:\n", " abstract = element.xpath('.//o:metadata//a:abstract', namespaces = ns)[0].text\n", " except IndexError:\n", " abstract = None\n", " try:\n", " cat = element.xpath('.//o:metadata//a:categories', namespaces = ns)[0].text.split(\" \")\n", " except IndexError:\n", " cat = None\n", "\n", " element.clear()\n", " yield Article(ident, pubdate, cat, title, abstract, authors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The conclusion - getting our names dataset\n", "\n", "Now the conclusion - I'll write a function that takes a pubmed xml, parses it using `iter_parse_pubmed()` or `iter_parse_arxiv()` and `score_authors()`, puts the authors into a data frame as shown above, and writes a CSV file. Note: if you want to just parse a pubmed file without going through this notebook, you can use the included `xml_parsing.py` script:\n", "\n", "```\n", "$ python xml_parsing.py --pubmed /path/to/pubmed.xml /path/to/output.csv\n", "```\n", "\n", "or \n", "\n", "```\n", "$ python xml_parsing.py --arxiv /path/to/arxiv.xml /path/to/output.csv\n", "```\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def write_names_to_file(in_file, out_file, pub_type=\"pubmed\"):\n", " col_names = [\"Date\", \"Journal\", \"Author Name\", \"Position\"]\n", " df = pd.DataFrame(columns=col_names)\n", "\n", " with open(out_file, 'w+') as out:\n", " df.to_csv(out, columns = col_names)\n", "\n", " counter = 0\n", "\n", " if pub_type == \"arxiv\":\n", " articles = iter_parse_arxiv(in_file)\n", " elif pub_type == \"pubmed\":\n", " articles = iter_parse_pubmed(in_file)\n", " else:\n", " raise IndexError\n", "\n", " for article in articles:\n", " first, last, second, penultimate, others = score_authors(article.authors)\n", " if first:\n", " row = pd.Series([str(article.pubdate), article.journal, first.first_name, \"first\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " else:\n", " continue\n", " try:\n", " row = pd.Series([str(article.pubdate), article.journal, last.first_name, \"last\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " try:\n", " row = pd.Series([str(article.pubdate), article.journal, second.first_name, \"second\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " try:\n", " row = pd.Series([str(article.pubdate), article.journal, penultimate.first_name, \"penultimate\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", " try:\n", " for x in others:\n", " row = pd.Series([str(article.pubdate), article.journal, x.first_name, \"other\"], name=article.article_id, index=col_names)\n", " df = df.append(row)\n", " except:\n", " pass\n", "\n", "\n", " if counter % 1000 == 0:\n", " print(counter)\n", " with open(out_file, 'a+') as out:\n", " df.to_csv(out, columns = col_names, header=False)\n", " df = pd.DataFrame(columns=col_names)\n", " counter +=1\n", " with open(out_file, 'a+') as out:\n", " df.to_csv(out, columns = col_names, header=False)" ] }, { "cell_type": "markdown", "metadata": { "focus": false, "id": "dcb30d7f-eecf-43f7-80e6-ab8723c92f13" }, "source": [ "## Next Step: Getting Genders\n", "Now the tough part - getting genders. \n", "\n", "I played around trying to get `sexmachine` and `GenderComputer` to work, but ran into some issues, and those projects don't seem like they're being maintained, so I thought i'd try [genderize.io](http://genderize.io) and [gender-api.com](gender-api.com). The trouble is these are a web apis, which takes more time than something run locally, and they have a limit to the number of requests you can make. The owners of both of these APIs generously provided me with enough requests to use them for free for this project, but I'll show how to use all three methods.\n", "\n", "On to a [new notebook](gender_detection.ipynb)..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 0 }