{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assign country codes to affiliations\n", "\n", "References:\n", "\n", "- https://github.com/greenelab/iscb-diversity/issues/8\n", "- https://towardsdatascience.com/geoparsing-with-python-c8f4c9f78940\n", "- https://github.com/elyase/geotext/issues/23" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We implemented affiliation extraction in the pubmedpy Python package for both PubMed and PMC XML records. These methods extract a sequence of textual affiliations for each author.\n", "Although, ideally, each affiliation record would refer to one and only one research organization, sometimes journals deposit multiple affiliations in a single structured affiliation.\n", "For example, we extracted the following composite affiliation for all authors of [PMC4147893](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4147893/):\n", "\n", "> 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.\n", "\n", "We designed a method for extracting countries from affiliations that accommodated multiple countries.\n", "We relied on two Python utilities to extract countries from text: geotext and geopy.geocoders.Nominatim.\n", "The first, geotext, used regular expressions to find mentions of places from the GeoNames database.\n", "In the above text, geotext detected four mentions of places in Germany: Saarland, Saarbrücken, Saarland, Germany.\n", "Anytime geotext identified 2 or more mentions of a country, we labeled the affiliation as including that country.\n", "\n", "geopy.geocoders.Nominatim converts names / addresses to geographic coordinates using the OpenStreetMap’s Nomatim service.\n", "We split textual affiliations by punctuation and found the first segment, in reverse order, that returned any Nomatim search results. For the above affiliation, the search order was “France”, “Paris”, “Laboratory of Computational and Quantitative Biology”, et cetera.\n", "Since searching “France” returns a match by Nomatim, the following queries would not be made.\n", "When a match was found, we extracted the country containing the location.\n", "This approach returns a single country for an affiliation when successful.\n", "When labeling affiliations with countries, we only used these values when geotext did not return results or had ambiguity amongst countries without multiple matches.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pathlib\n", "import functools\n", "import re\n", "import lzma\n", "\n", "import backoff\n", "import geotext\n", "import geopy.geocoders\n", "import jsonlines\n", "import pandas\n", "import ratelimit\n", "import tqdm.notebook" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pmcidpositionaffiliation
0PMC10032111 University of Cologne, Institute of Genetics...
1PMC10032121 University of Cologne, Institute of Genetics...
\n", "
" ], "text/plain": [ " pmcid position affiliation\n", "0 PMC100321 1 1 University of Cologne, Institute of Genetics...\n", "1 PMC100321 2 1 University of Cologne, Institute of Genetics..." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read PubMed Central affiliations\n", "affil_pmc_df = pandas.read_csv(\"data/pmc/affiliations.tsv.xz\", sep='\\t')\n", "assert affil_pmc_df.affiliation.notna().all()\n", "affil_pmc_df.head(2)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pmidpositionaffiliation
074774121Dept. of Pathology, Cornell Medical College, N...
174798911National Center for Human Genome Research, Nat...
\n", "
" ], "text/plain": [ " pmid position affiliation\n", "0 7477412 1 Dept. of Pathology, Cornell Medical College, N...\n", "1 7479891 1 National Center for Human Genome Research, Nat..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read PubMed affiliations\n", "affil_pm_df = pandas.read_csv(\"data/pubmed/affiliations.tsv.xz\", sep='\\t')\n", "assert affil_pm_df.affiliation.notna().all()\n", "affil_pm_df.head(2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "446,551 unique affiliation strings\n" ] }, { "data": { "text/plain": [ "['\"Athena\" Research and Innovation Center, Athens, 15125, Greece.',\n", " '\"Athena\" Research and Innovation Center, Athens, 15125, Greece. kzagganas@uop.gr.',\n", " '\"Biology of Spirochetes\" Unit, Institut Pasteur, 28 Rue Du Docteur Roux, 75724, Paris Cedex 15, France, mathieu.picardeau@pasteur.fr.',\n", " '\"Cephalogenetics\" Genetic Center, Athens, Greece.',\n", " '\"Cephalogenetics\" Genetic Center, Athens, Greece. cyapi@med.uoa.gr.']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# combine affiliations from PubMed Central and PubMed\n", "affiliations = sorted(set(affil_pmc_df.affiliation) | set(affil_pm_df.affiliation))\n", "print(f\"{len(affiliations):,} unique affiliation strings\")\n", "affiliations[:5]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def get_countries_geotext(text):\n", " \"\"\"\n", " Use geotext package to get country metions.\n", " The returned counts dict maps from country code\n", " to number of places located to that county in the text.\n", "\n", " For example, the following code detects GR twice, first for\n", " \"Athens\" and second for \"Greece\":\n", "\n", " ```\n", " >>> text = '\"Athena\" Research and Innovation Center, Athens, 15125, Greece.'\n", " >>> get_countries_geotext(text)\n", " {'GR': 2}\n", " ```\n", "\n", " See https://github.com/elyase/geotext\n", " \"\"\"\n", " geo_text = geotext.GeoText(text)\n", " return dict(geo_text.country_mentions)\n", "\n", "geopy.geocoders.options.default_user_agent = 'https://github.com/greenelab/iscb-diversity'\n", "geolocator = geopy.geocoders.Nominatim(timeout=5)\n", "\n", "@functools.lru_cache(maxsize=200_000)\n", "@backoff.on_exception(backoff.expo, ratelimit.RateLimitException, max_tries=8)\n", "@ratelimit.limits(calls=1, period=1)\n", "def _geocode(text):\n", " \"\"\"\n", " https://operations.osmfoundation.org/policies/nominatim/\n", " \"\"\"\n", " return geolocator.geocode(text, addressdetails=True)\n", "\n", "email_pattern = re.compile(r\"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+\")\n", "split_pattern = re.compile(r\"; |, |\\. \")\n", "\n", "def get_country_geocode(affiliation):\n", " \"\"\"\n", " Lookup country using OSM Nomatim.\n", " Splits affiliations text into parts by punctuation,\n", " searching for country matches in reverse. Parts that\n", " look like email addresses are discarded.\n", " \"\"\"\n", " parts = split_pattern.split(affiliation)\n", " for part in reversed(parts):\n", " part = part.rstrip(\".\")\n", " if email_pattern.fullmatch(part):\n", " continue\n", " location = _geocode(part)\n", " if not location:\n", " continue\n", " try:\n", " return location.raw['address']['country_code'].upper()\n", " except KeyError:\n", " return None\n", "\n", "\n", "def query_affiliation(affiliation: str):\n", " return dict(\n", " affiliation=affiliation,\n", " country_geocode=get_country_geocode(affiliation),\n", " countries_geotext=get_countries_geotext(affiliation),\n", " )" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "446,551 total affiliations: 475,120 already queried, 0 new\n" ] } ], "source": [ "# read already queried affiliations\n", "path = pathlib.Path('data/affiliations/geocode.jsonl.xz')\n", "lines = jsonlines.Reader(lzma.open(path, \"rt\")) if path.exists() else []\n", "existing = {row['affiliation'] for row in lines}\n", "new = sorted(set(affiliations) - existing)\n", "print(f\"{len(affiliations):,} total affiliations: {len(existing):,} already queried, {len(new):,} new\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ae65a1be12ac4987b69ef500f474cc72", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# query new affiliations and append to JSON Lines file\n", "with lzma.open(path, mode='at') as write_file:\n", " with jsonlines.Writer(write_file) as writer:\n", " for affiliation in tqdm.notebook.tqdm(new):\n", " result = query_affiliation(affiliation)\n", " writer.write(result)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'affiliation': \"'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.\",\n", " 'country_geocode': 'FR',\n", " 'countries_geotext': {'DE': 4, 'FR': 4, 'US': 2},\n", " 'countries': ['DE', 'FR', 'US']}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the jsonlines file\n", "with jsonlines.Reader(lzma.open(path, \"rt\")) as reader:\n", " lines = list(reader)\n", "# Show a single line\n", "lines[10]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'affiliation': 'Laboratory of Computational and Quantitative Biology, Paris, France',\n", " 'country_geocode': 'FR',\n", " 'countries_geotext': {'FR': 2}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_affiliation(\"Laboratory of Computational and Quantitative Biology, Paris, France\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def get_consensus_countries(line: dict) -> list:\n", " \"\"\"\n", " Get a list of countries resulting from a consensus algorithm\n", " between countries_geotext and country_geocode.\n", " \"\"\"\n", " country_geotext: dict = line['countries_geotext']\n", " country_geocode: str = line['country_geocode']\n", " if not country_geotext:\n", " # geotext empty, so use geocode country or nothing\n", " return [country_geocode] if country_geocode else []\n", " countries_gte_2 = {\n", " country: count for country, count in\n", " country_geotext.items() if count >= 2\n", " }\n", " if countries_gte_2:\n", " # countries with multiple mentions according to geotext\n", " return list(countries_gte_2)\n", " if country_geocode and country_geocode in country_geotext:\n", " # geocode country matches a geotext country\n", " return [country_geocode]\n", " return list(country_geotext)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "for line in lines:\n", " line[\"countries\"] = get_consensus_countries(line)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'affiliation': \"'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.\",\n", " 'country_geocode': 'FR',\n", " 'countries_geotext': {'DE': 4, 'FR': 4, 'US': 2},\n", " 'countries': ['DE', 'FR', 'US']}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines[10]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "with lzma.open(path, mode='wt') as write_file:\n", " with jsonlines.Writer(write_file) as writer:\n", " writer.write_all(lines)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
affiliationcountry
0\"Athena\" Research and Innovation Center, Athen...GR
1\"Athena\" Research and Innovation Center, Athen...GR
\n", "
" ], "text/plain": [ " affiliation country\n", "0 \"Athena\" Research and Innovation Center, Athen... GR\n", "1 \"Athena\" Research and Innovation Center, Athen... GR" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "affil_df = pandas.DataFrame(lines)\n", "affil_df = (\n", " affil_df\n", " .explode(\"countries\")\n", " .rename(columns={\"countries\": \"country\"})\n", " .dropna(subset=['country'])\n", " [[\"affiliation\", \"country\"]]\n", " .drop_duplicates() # for safety, duplicates not expected\n", ")\n", "affil_df.head(2)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1117800" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(lines)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(488554, 2)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "affil_df.shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "US 30.66%\n", "CN 13.03%\n", "GB 6.89%\n", "DE 6.08%\n", "FR 3.95%\n", "CA 3.11%\n", "IT 3.11%\n", "JP 2.89%\n", "ES 2.84%\n", "AU 2.62%\n", "NL 1.87%\n", "IN 1.69%\n", "KR 1.67%\n", "BR 1.54%\n", "CH 1.32%\n", "SE 1.26%\n", "TW 1.10%\n", "BE 1.01%\n", "DK 0.96%\n", "SG 0.69%\n", "Name: country, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Most common countries\n", "affil_df.country.value_counts(normalize=True).head(20).map(\"{:.02%}\".format)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# save affiliations to a table\n", "affil_df.to_csv(\"data/affiliations/countries.tsv.xz\", sep='\\t', index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }