{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Matching Publications to PubMed IDs\n", "\n", "**2017-12-10**\n", "\n", "Earlier this week, I got an e-mail from Lucia Santamaria from the \"[Gender Gap in\n", "Science][1]\" project from the International Council for Science. They are\n", "trying to systematically measure the gender gap in a number of ways, including\n", "looking at publication records. One of the important parts of their effort is to\n", "find ways to validate methods of gender assignment or inference.\n", "\n", "Lucia was writing about some data found in [the paper I wrote with with Melanie\n", "Stefan][2] about women in computational biology. In particular, she\n", "wanted the dataset that we got from Filardo et. al. that had a list of ~3000\n", "journal articles with known first-author genders. This is a great dataset to\n", "have, since the authors of that paper did all the hard work of actually going\n", "one-by-one through papers and finding the author genders by hand (often\n", "searching institutional websites and social media profiles for pictures etc).\n", "\n", "This allowed us to validate our gender inference based on author first name\n", "against a dataset where the truth is known (it did pretty well).\n", "\n", "And that's what Lucia wants to do as well. There's just one problem: the\n", "form of the data from Filardo and colleagues is human-readable, but not machine-\n", "readable. For example, one rwo of the data looks like this:\n", "\n", "| Article_ID | Full_Title | Link_to_Article | Journal | Year | Month | First Author Gender |\n", "|------------|------------|-----------------|---------|------|-------|---------------------|\n", "| Aaby et al. (2010) | Non-specific effects of standard measles vaccine at 4.5 and 9 months of age on childhood mortality: randomised controlled trial | NA | BMJ | 2010 | 12 - Dec | Male |\n", "| Aaron et al. (2007) | Tiotropium in Combination with Placebo Salmeterol or Fluticasone–Salmeterol for Treatment of Chronic Obstructive Pulmonary Disease: A Randomized Trial | http://annals.org/article.aspx?articleid=734106 | Annals of Internal Medicine | 2007 | 04 - Apr | Male |\n", "\n", "What we'd really like to have is some unique identifier (eg the Pubmed ID or\n", "doi) associated with each record. That would make it much easier to cross-\n", "reference with other datasets, including the table of hundreds of thousands of\n", "publications that we downloaded as part of this study. I had this when we\n", "published - it's how we validated our method, but I'm embarassed to say it\n", "didn't get included with the paper, and I recently had to delete my old computer\n", "backups which is the only place it lived.\n", "\n", "I told her how we initially did this, and it sounds like she managed to make it\n", "work herself, but I thought it would be worth documenting how I did it\n", "orginially (and maybe improve it slightly). So we've got a table with titles,\n", "author last names, years, and journals, and what we want is a pubmed id or doi.\n", "\n", "I'm using [julia][3] as my go-to language these days, and the first step\n", "is to load in the table as a dataframe that we can manipulate:\n", "\n", "[1]: https://icsugendergapinscience.org/work-packages/publication-patterns/\n", "[2]: https://doi.org/10.1371/journal.pcbi.1005134\n", "[3]: http://julialang.org" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ids[1] = \"Aaby et al. (2010)\"\n" ] }, { "data": { "text/plain": [ "\"Aaby et al. (2010)\"" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using DataFrames\n", "using CSV\n", "\n", "df = CSV.read(\"../data/known-gender.csv\")\n", "\n", "# get an array with the \"article id\" as a string\n", "ids = Vector{String}(df[:Article_ID])\n", "@show ids[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the current form of the `:Article_ID` has both author name and\n", "the year, but we want to be able to handle these separately. But they all take\n", "the same form, the author name, sometimes followed by \"et al.\", followed by the\n", "year in parentheses[^c90d47d1]. So I'm going to use regular expressions, which\n", "can look like gobbledegook if you're not familiar with it. It's beyond the scope\n", "of this post to describe it, but I highly recomend [regexr][4] as a resource for\n", "learning and testing. The regex I used for this is composed of 3 parts:\n", "\n", "- The author name, which can have some number of letters, spaces, `-`, and `'`: `([\\w\\s\\-']+)`\n", " - these are also at the beginning of the line, so I add a `^` (just to be safe)\n", "- Then \"et al.\": ` et al\\.`\n", " - the `.` is a special character, so it's escaped with `\\`\n", " - this is also optional, so I wrap it in parentheses and add `?`\n", "- The year, which is 4 digits, and wrapped in parentheses: `\\((\\d{4})\\)`\n", " - the inner parentheses are so I can grab it as a group\n", " - it should be the end, so I finish with `$`\n", "\n", "So, the [complete regex][5]: `^([\\w\\s\\-']+)( et al\\.)? \\((\\d{4})\\)$`\n", "\n", "And now I want to apply that search to everything in the `Article_ID` column:\n", "\n", "[4]: https://regexr.com/\n", "[5]: https://regexr.com/3hpod" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3204-element Array{Int64,1}:\n", " 2010\n", " 2007\n", " 2004\n", " 2008\n", " 2007\n", " 2011\n", " 2002\n", " 2005\n", " 2003\n", " 2013\n", " 2013\n", " 2005\n", " 1998\n", " ⋮\n", " 2012\n", " 2013\n", " 2000\n", " 2000\n", " 2010\n", " 2013\n", " 2008\n", " 2005\n", " 2001\n", " 1999\n", " 2007\n", " 2011" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# confirm this matches every row\n", "sum(.!ismatch.(r\"^([\\w\\s\\-']+)( et al\\.)? \\((\\d{4})\\)$\", ids)) # 3204\n", "\n", "# apply match across the whole ids array with `match.()`\n", "matches = match.(r\"^([\\w\\s\\-']+)( et al\\.)? \\((\\d{4})\\)$\", ids)\n", "# get the last name of first authors\n", "\n", "firstauthors = [m.captures[1] for m in matches]\n", "# get the year... yeah there's a column for this, but since we have it...\n", "years = [parse(Int, m.captures[3]) for m in matches]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also have a column for the journal names, but pubmed searching is a bit\n", "idiosyncratic and works better if you use the right abreviations. So next, I\n", "built a dictionary of abbreviations, and used an array comprehension to get a\n", "new array with the journal names:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3204-element Array{String,1}:\n", " \"BMJ\" \n", " \"Ann Intern Med\" \n", " \"Arch Intern Med\"\n", " \"N Engl J Med\" \n", " \"JAMA\" \n", " \"Ann Intern Med\" \n", " \"N Engl J Med\" \n", " \"BMJ\" \n", " \"Ann Intern Med\" \n", " \"BMJ\" \n", " \"BMJ\" \n", " \"BMJ\" \n", " \"Arch Intern Med\"\n", " ⋮ \n", " \"Arch Intern Med\"\n", " \"Lancet\" \n", " \"Arch Intern Med\"\n", " \"Arch Intern Med\"\n", " \"Ann Intern Med\" \n", " \"Arch Intern Med\"\n", " \"JAMA\" \n", " \"Arch Intern Med\"\n", " \"JAMA\" \n", " \"Lancet\" \n", " \"Lancet\" \n", " \"Ann Intern Med\" " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "journals = Vector{String}(df[:Journal])\n", "\n", "replacements = Dict(\n", " \"Archives of Internal Medicine\" => \"Arch Intern Med\",\n", " \"Annals of Internal Medicine\" => \"Ann Intern Med\",\n", " \"The Lancet\" => \"Lancet\",\n", " \"NEJM\" => \"N Engl J Med\"\n", " )\n", "\n", "journals = [in(x, keys(replacements)) ? replacements[x] : x for x in journals]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, words like \"and\", \"in\", \"over\" don't help a lot when searching, and the\n", "titles currently have special characters (like `:` or `()`) that also don't help\n", "or could even hurt our ability to search. So I took the title column, and built\n", "new strings using only words that are 5 characters or more. I did this all in\n", "one go, but to explain, the `matchall()` function finds all of the 5 or more\n", "letter words (that's `\\w{5,}` in regex) and returns an array of matches. Then\n", "the `join()` function puts them together in a single string (separated by a\n", "space since I passed `' '` as an argument):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3204-element Array{String,1}:\n", " \"specific effects standard measles vaccine months childhood mortality randomised controlled trial\" \n", " \"Tiotropium Combination Placebo Salmeterol Fluticasone Salmeterol Treatment Chronic Obstructive Pulmonary Disease Randomized Trial\" \n", " \"Blocker Dialysis Patients Association Hospitalized Heart Failure Mortality\" \n", " \"Safety Immunogenicity AS02D Malaria Vaccine Infants\" \n", " \"Strain Acute Recurrent Coronary Heart Disease Events\" \n", " \"Comparative Effectiveness Management Interventions Fracture Systematic Review\" \n", " \"Cardiac Resynchronization Chronic Heart Failure\" \n", " \"Utility testing monoclonal bands serum patients suspected osteoporosis retrospective cross sectional study\" \n", " \"Short Effects Cannabinoids Patients Infection Randomized Placebo Controlled Clinical Trial\" \n", " \"Effect lower sodium intake health systematic review analyses\" \n", " \"Effect increased potassium intake cardiovascular factors disease systematic review analyses\" \n", " \"Rectal artemether versus intravenous quinine treatment cerebral malaria children Uganda randomised clinical trial\" \n", " \"Hyperkalemia Hospitalized Patients Causes Adequacy Treatment Results Attempt Improve Physician Compliance Published Therapy Guidelines\" \n", " ⋮ \n", " \"Supratherapeutic Dosing Acetaminophen Among Hospitalized Patients\" \n", " \"Efficacy safety immunology inactivated adjuvant enterovirus vaccine children China multicentre randomised double blind placebo controlled phase trial\"\n", " \"Frequency Major Hemorrhage Patients Treated Unfractionated Intravenous Heparin Venous Thrombosis Pulmonary Embolism Study Routine Clinical Practice\" \n", " \"Patients Depression Likely Follow Recommendations Reduce Cardiac During Recovery Myocardial Infarction\" \n", " \"Glucose Independent Black White Differences Hemoglobin Levels Cross sectional Analysis Studies\" \n", " \"Health Associated Infections analysis Costs Financial Impact Health System\" \n", " \"Effectiveness Specialized Palliative Systematic Review\" \n", " \"Regional Institutional Variation Initiation Early Resuscitate Orders\" \n", " \"Association Between Polymorphism Transforming Growth Factor Breast Cancer Among Elderly White Women\" \n", " \"Effect vitamin frequency reflex sympathetic dystrophy wrist fractures randomised trial\" \n", " \"Artemether lumefantrine versus amodiaquine sulfadoxine pyrimethamine uncomplicated falciparum malaria Burkina randomised inferiority trial\" \n", " \"Patient Interest Sharing Personal Health Record Information Based Survey\" " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titles = join.(matchall.(r\"\\w{5,}\", Vector{String}(df[:Full_Title])), ' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we've got all the elements of our search, and I just put them into an\n", "array of tuples to make them easier to deal with:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3204-element Array{Tuple{String,SubString{String},String,Int64},1}:\n", " (\"specific effects standard measles vaccine months childhood mortality randomised controlled trial\", \"Aaby\", \"BMJ\", 2010) \n", " (\"Tiotropium Combination Placebo Salmeterol Fluticasone Salmeterol Treatment Chronic Obstructive Pulmonary Disease Randomized Trial\", \"Aaron\", \"Ann Intern Med\", 2007) \n", " (\"Blocker Dialysis Patients Association Hospitalized Heart Failure Mortality\", \"Abbott\", \"Arch Intern Med\", 2004) \n", " (\"Safety Immunogenicity AS02D Malaria Vaccine Infants\", \"Abdulla\", \"N Engl J Med\", 2008) \n", " (\"Strain Acute Recurrent Coronary Heart Disease Events\", \"Aboa-Eboule\", \"JAMA\", 2007) \n", " (\"Comparative Effectiveness Management Interventions Fracture Systematic Review\", \"Abou-Setta\", \"Ann Intern Med\", 2011) \n", " (\"Cardiac Resynchronization Chronic Heart Failure\", \"Abraham\", \"N Engl J Med\", 2002) \n", " (\"Utility testing monoclonal bands serum patients suspected osteoporosis retrospective cross sectional study\", \"Abrahamsen\", \"BMJ\", 2005) \n", " (\"Short Effects Cannabinoids Patients Infection Randomized Placebo Controlled Clinical Trial\", \"Abrams\", \"Ann Intern Med\", 2003) \n", " (\"Effect lower sodium intake health systematic review analyses\", \"Aburto\", \"BMJ\", 2013) \n", " (\"Effect increased potassium intake cardiovascular factors disease systematic review analyses\", \"Aburto\", \"BMJ\", 2013) \n", " (\"Rectal artemether versus intravenous quinine treatment cerebral malaria children Uganda randomised clinical trial\", \"Aceng\", \"BMJ\", 2005) \n", " (\"Hyperkalemia Hospitalized Patients Causes Adequacy Treatment Results Attempt Improve Physician Compliance Published Therapy Guidelines\", \"Acker\", \"Arch Intern Med\", 1998) \n", " ⋮ \n", " (\"Supratherapeutic Dosing Acetaminophen Among Hospitalized Patients\", \"Zhou\", \"Arch Intern Med\", 2012) \n", " (\"Efficacy safety immunology inactivated adjuvant enterovirus vaccine children China multicentre randomised double blind placebo controlled phase trial\", \"Zhu\", \"Lancet\", 2013) \n", " (\"Frequency Major Hemorrhage Patients Treated Unfractionated Intravenous Heparin Venous Thrombosis Pulmonary Embolism Study Routine Clinical Practice\", \"Zidane\", \"Arch Intern Med\", 2000)\n", " (\"Patients Depression Likely Follow Recommendations Reduce Cardiac During Recovery Myocardial Infarction\", \"Ziegelstein\", \"Arch Intern Med\", 2000) \n", " (\"Glucose Independent Black White Differences Hemoglobin Levels Cross sectional Analysis Studies\", \"Ziemer\", \"Ann Intern Med\", 2010) \n", " (\"Health Associated Infections analysis Costs Financial Impact Health System\", \"Zimlichman\", \"Arch Intern Med\", 2013) \n", " (\"Effectiveness Specialized Palliative Systematic Review\", \"Zimmermann\", \"JAMA\", 2008) \n", " (\"Regional Institutional Variation Initiation Early Resuscitate Orders\", \"Zingmond\", \"Arch Intern Med\", 2005) \n", " (\"Association Between Polymorphism Transforming Growth Factor Breast Cancer Among Elderly White Women\", \"Ziv\", \"JAMA\", 2001) \n", " (\"Effect vitamin frequency reflex sympathetic dystrophy wrist fractures randomised trial\", \"Zollinger\", \"Lancet\", 1999) \n", " (\"Artemether lumefantrine versus amodiaquine sulfadoxine pyrimethamine uncomplicated falciparum malaria Burkina randomised inferiority trial\", \"Zongo\", \"Lancet\", 2007) \n", " (\"Patient Interest Sharing Personal Health Record Information Based Survey\", \"Zulman\", \"Ann Intern Med\", 2011) " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "searches = collect(zip(titles, firstauthors, journals, years))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, I iterated through this array and composed searchs, using the\n", "[`BioServices.EUtils`][6] package to do the search and retrieval.\n", "\n", "\n", "[6]: https://github.com/BioJulia/BioServices.jl" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "using BioServices.EUtils\n", "\n", "for s in searches\n", " #= if you do too many queries in a row, esearch raises an exception. The solution\n", " is to pause (here for 10 sections) and then try again. =#\n", " try\n", " res = esearch(db=\"pubmed\", term=\"($(s[1]) [title]) AND ($(s[2]) [author]) AND ($(s[3]) [ journal]) AND ($(s[4]) [pdat])\")\n", " catch\n", " sleep(10)\n", " res = esearch(db=\"pubmed\", term=\"($(s[1]) [title]) AND ($(s[2]) [author]) AND ($(s[3]) [ journal]) AND ($(s[4]) [pdat])\")\n", " end\n", "\n", "\n", " doc = parsexml(res.data)\n", " #= this returns an array of pmids, since in the xml, they're separated by\n", " newlines. The `strip` function removes leading and trailing newlines, but\n", " not ones in between ids. =#\n", " i = split(content(findfirst(doc, \"//IdList\")) |> strip, '\\n')\n", "\n", " if length(i) == 1\n", " #= if no ids are returned, there's an array with just an empty string,\n", " in which case I add 0 to the array =#\n", " length(i[1]) == 0 ? push!(pmids, 0) : push!(pmids, parse(Int, i[1]))\n", " else\n", " # if there are more than 1 pmids returned, I just add 9 to the array\n", " push!(pmids, 9)\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I actually did better than the last time I tried this - 2447 records associated\n", "with only 1 pmid. 13 of the searches had more than 1 pmid, and the rest (744)\n", "didn't return any hits." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "@show sum(pmids .> 9)\n", "@show sum(pmids .== 9)\n", "@show sum(pmids .== 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I added the ids to the dataframe and saved it as a new file (so I don't have to\n", "do the searching again). Since I have the urls for many of the papers, I was\n", "thinking I could try to identify the doi's associated with them from their\n", "webpages, but that will have to wait until another time.\n", "\n", "For now, the last step is to save it as a csv and send it off to Lucia:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "CSV.write(\"../data/withpmids.csv\", df)" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.1", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }