{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) | WP2 [User Story 6](https://www.pidforum.org/t/pid-graph-graphql-example-disambiguate-researchers/931): As a researcher, I am looking for more information about another researcher with a common name, but don’t know his/her ORCID ID.\n", ":------------- | :------------- | :-------------\n", "\n", "It is important to be able to locate a researcher of interest even though their ORCID ID is unknown. For example, a reader of a scientific publication may wish to find out more about one of the authors, whereby the publisher has not cross-referenced that author's name to ORCID.

\n", "\n", "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to disambiguate a researcher name via a *funnel* approach:\n", " * First all researcher records matching query \"John AND Smith\" and retrieved, and an alphabetically sorted list of affiliations and the corresponding researcher names is displayed;\n", " * Then the notebook simulates the user selecting one of the affiliations (in our case \"University of Arizona\"), and then performs a more detailed query: \"John AND Smith AND University of Arizona\". The second query retrieves and displays a much smaller set of results, now also containing the researcher's publications, thus helping the user pinpoint the researcher of interest more easily.\n", "\n", "**Goal**: By the end of this notebook, you should be able successfully disambiguate a researcher name of interest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install libraries and prepare GraphQL client" ] }, { "cell_type": "code", "execution_count": 228, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Install required Python packages\n", "!pip install gql requests" ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [], "source": [ "# Prepare the GraphQL client\n", "import requests\n", "from IPython.display import display, Markdown\n", "from gql import gql, Client\n", "from gql.transport.requests import RequestsHTTPTransport\n", "\n", "_transport = RequestsHTTPTransport(\n", " url='https://api.datacite.org/graphql',\n", " use_json=True,\n", ")\n", "\n", "client = Client(\n", " transport=_transport,\n", " fetch_schema_from_transport=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define and run GraphQL query\n", "Define the GraphQL query to find all publications including co-authors for [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366):" ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [], "source": [ "# Generate the GraphQL query to retrieve up to 100 researchers matching query \"John and Smith\"\n", "query_params = {\n", " \"query\" : \"John AND Smith\",\n", " \"max_researchers\" : 100,\n", " \"query_end_cursor\" : \"\"\n", "}\n", "\n", "query_str = \"\"\"query getResearchersByName(\n", " $query: String!,\n", " $max_researchers: Int!,\n", " $query_end_cursor : String!\n", " )\n", "{\n", " people(query: $query, first: $max_researchers, after: $query_end_cursor) {\n", " totalCount\n", " pageInfo {\n", " hasNextPage\n", " endCursor\n", " } \n", " nodes {\n", " id\n", " givenName\n", " familyName\n", " name\n", " affiliation {\n", " name\n", " }\n", " }\n", " }\n", "}\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the above query via the GraphQL client" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [], "source": [ "import json\n", "found_next_page = True\n", "\n", "# Initialise overall data dict that will store results\n", "data = {}\n", "\n", "# Keep retrieving results until there are no more results left\n", "while True:\n", " query = gql(\"%s\" % query_str)\n", " res = client.execute(query, variable_values=json.dumps(query_params))\n", " if \"people\" not in data:\n", " data = res\n", " else:\n", " people = res[\"people\"]\n", " data[\"people\"][\"nodes\"].extend(people[\"nodes\"])\n", " pageInfo = people[\"pageInfo\"]\n", " if pageInfo[\"hasNextPage\"]:\n", " if pageInfo[\"endCursor\"] is not None:\n", " query_params[\"query_end_cursor\"] = pageInfo[\"endCursor\"] \n", " else:\n", " break\n", " else:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List researcher details\n", "List in tabular format affilitions and the corresponding researcher names. This allows the user to select one of the affiliations to use in a more detailed query (see below) that also retrieves publications." ] }, { "cell_type": "code", "execution_count": 234, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Total number of researchers found: **210**
The list of researchers by affiliation is as follows:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "| Affiliation | Researcher Names |\n", "|---|---|\n", "American Chemical Society | John Smith\n", "American Science and Engineering, Inc. | Henry John Peter Smith\n", "Bank Street College of Education | John Smith\n", "Bedford Institute of Oceanography | John Smith\n", "Beecham Pharmaceuticals | John Smith\n", "Birkenhead High School Academy | John Arthur Smith\n", "Bureau of Ocean Energy Management, Pacific OCS Region | John Smith\n", "CU Sports Medicine and Performance | John-Rudolph Smith\n", "Charles Sturt University - Wagga Wagga Campus | John Smith\n", "Church of Norway | John Arthur Smith\n", "City College of New York | John Smith Del Rosario\n", "Colorado School of Mines | John Smith\n", "Cornell University | John-David Smith\n", "CottonInfo | John Smith\n", "Drew University | John Smith\n", "East Carolina University | John Smith\n", "Fairleigh Dickinson University | John Smith\n", "Federation of Liberian Youth - FLY | John Solunta Smith Jr\n", "Fire Risk Assessment Network | John Smith\n", "Flagburn Health Center | John Smith\n", "Fluent Technology | John Smith\n", "George Washington University | John Smith\n", "Georgia State University | John Smith\n", "GlaxoSmithKline Plc | John Smith\n", "Lipscomb University | John Smith\n", "London University | John Smith\n", "Louisiana State University | John F. Smith\n", "MSG Software (USA), Inc. | Henry John Peter Smith\n", "Manhattan College | Henry John Peter Smith\n", "Michigan State University | John Smith\n", "Millersville University | John Smith\n", "NASA Langley Research Center | John Smith\n", "New South Wales Department of Primary Industries Agriculture | John Smith\n", "Northeastern University | Henry John Peter Smith\n", "Northwestern University | John F. Smith\n", "Nova Scotia Health Authority South Western Nova Scotia | John Smith\n", "OCS Energy Consultant | John Smith\n", "Ohio State University | John R. Smith\n", "Oxford University Press | John Arthur Smith\n", "Peking University | John Solunta Smith Jr\n", "Pennsylvania State University | John Smith\n", "Proof Read My File | John Smith\n", "RMIT University City Campus | John Smith\n", "Retired | John Arthur Smith\n", "Rutgers New Jersey Medical School | John Smith Del Rosario\n", "Rutgers University Camden | John Smith\n", "Sample invited position | John Smith\n", "Sigma Xi the Scientific Research Society | John Smith\n", "TPE Associates Inc | Henry John Peter Smith\n", "Technical Support | John Smith\n", "Tennessee Technological University | John Smith\n", "The New School for Social Research | John Smith\n", "The University of St Andrews | Christopher John Smith\n", "Tufts University | Henry John Peter Smith\n", "Ulster Univeristy | John Smith\n", "Ulster University | John Smith\n", "University College London | John Smith\n", "University at Buffalo | John Smith\n", "University of Arizona | Smith, John E. 3rd\n", "University of California Davis | John R Smith\n", "University of Cambridge | John Arthur Smith\n", "University of Central Missouri | John Smith\n", "University of Colorado | John Smith\n", "University of Colorado Boulder | JOHN SMITH, John Smith\n", "University of Liverpool | John Arthur Smith, Quintin-John Smith\n", "University of Michigan | John R. Smith\n", "University of Missouri Columbia | John Smith\n", "University of Ottawa | John Smith\n", "University of Oxford | Christopher John Smith\n", "University of Pennsylvania | John F. Smith, John Smith\n", "University of St Andrews | Christopher John Smith\n", "University of Strathclyde | John Smith\n", "University of Toledo | John-David Smith\n", "University of Toronto | John Smith\n", "University of Virginia | Smith, John E. 3rd\n", "University of York | John Smith\n", "Vanderbilt University | John Smith\n", "Virginia Commonwealth University | John Lee Smith\n", "Visidyne, Inc. | Henry John Peter Smith\n", "Yale University | John Smith\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Collect names and affiliations for the researchers found\n", "# Test if fieldValue matches (case-insensitively) a Solr-style query (with \" AND \" representing the logical AND, and \" \" representing the logical OR)\n", "def testIfPresentCaseInsensitive(solrQuery, fieldValueLowerCase):\n", " for orTerms in solrQuery.split(\" AND \"):\n", " present = False\n", " for term in orTerms.split(\" \"):\n", " if term.lower() in fieldValueLowerCase:\n", " present = True\n", " break\n", " if not present:\n", " return False\n", " return True\n", "\n", "people = data['people']\n", "af2Names = {}\n", "totalCount = 0\n", "for node in people['nodes']:\n", " id = node['id']\n", " name = node['name']\n", "# TODO: Remove if we manage to search only individual fields\n", " if not testIfPresentCaseInsensitive(query_params['query'], name.lower()):\n", " continue\n", " totalCount += 1\n", " for af in node['affiliation']:\n", " affiliation = af['name']\n", " if affiliation not in af2Names:\n", " af2Names[affiliation] = set()\n", " af2Names[affiliation].add(name)\n", "\n", "tableBody = \"\"\n", "for af,names in sorted(af2Names.items()):\n", " tableBody += af + \" | \" + ', '.join(names) + \"\\n\"\n", "display(Markdown(\"Total number of researchers found: **%d**
The list of researchers by affiliation is as follows:\" % totalCount))\n", "display(Markdown(\"\"))\n", "\n", "display(Markdown(\"| Affiliation | Researcher Names |\\n|---|---|\\n%s\" % tableBody))" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [], "source": [ "# Generate the GraphQL query to retrieve all researchers matching query \"John and Smith\" and affiliation \"University of Arizona\", now with works\n", "name_query = \"John AND Smith\"\n", "affiliation_query = \"\\\"University of Arizona\\\"\"\n", "query_params1 = {\n", " \"query\" : name_query + \" AND \" + affiliation_query,\n", " \"max_researchers\" : 10,\n", " \"query_end_cursor\" : \"\" \n", "}\n", "\n", "query_str = \"\"\"query getResearchersByName(\n", " $query: String!,\n", " $max_researchers: Int!,\n", " $query_end_cursor : String!\n", " )\n", "{\n", " people(query: $query, first: $max_researchers, after: $query_end_cursor) {\n", " totalCount\n", " pageInfo {\n", " hasNextPage\n", " endCursor\n", " } \n", " nodes {\n", " id\n", " givenName\n", " familyName\n", " name\n", " affiliation {\n", " name\n", " }\n", " works(first: 3) {\n", " nodes {\n", " id\n", " publicationYear\n", " publisher\n", " titles {\n", " title\n", " }\n", " creators {\n", " id\n", " name\n", " affiliation {\n", " id\n", " name\n", " }\n", " }\n", " subjects {\n", " subject\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the above query via the GraphQL client" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [], "source": [ "import json\n", "found_next_page = True\n", "\n", "# Initialise overall data dict that will store results\n", "data1 = {}\n", "\n", "# Keep retrieving results until there are no more results left\n", "while True:\n", " query = gql(\"%s\" % query_str)\n", " res = client.execute(query, variable_values=json.dumps(query_params1))\n", " if \"people\" not in data1:\n", " data1 = res\n", " else:\n", " people = res[\"people\"]\n", " data1[\"people\"][\"nodes\"].extend(people[\"nodes\"])\n", " pageInfo = people[\"pageInfo\"]\n", " if pageInfo[\"hasNextPage\"]:\n", " if pageInfo[\"endCursor\"] is not None:\n", " query_params[\"query_end_cursor\"] = pageInfo[\"endCursor\"] \n", " else:\n", " break\n", " else:\n", " break" ] }, { "cell_type": "code", "execution_count": 237, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "| First Name | Surname | Link to ORCID | Affiliations | Works | \n", "|---|---|---|---|---|\n", "John E | Smith | [Smith, John E. 3rd](https://orcid.org/0000-0002-0888-1274) | University of Arizona
University of Virginia | [Smith, John Edward](https://orcid.org/0000-0002-0888-1274) (2020) [CS_216516.sf3](https://doi.org/10.7910/dvn/ozqd7g/1a1qty) *Harvard Dataverse*
[Smith, John Edward](https://orcid.org/0000-0002-0888-1274) (2020) [human N2Aus PKA phosphorylation](https://doi.org/10.7910/dvn/ozqd7g) *Harvard Dataverse*
[Lostal, William](https://orcid.org/0000-0003-1014-1950) et al. (2019) [Titin splicing regulates cardiotoxicity...](https://doi.org/10.1126/scitranslmed.aat6072) *American Association for the Advancement of Science (AAAS)*
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from textwrap import shorten\n", "\n", "# Collect all relevant details for the researchers found\n", "tableBody=set()\n", "people = data1['people']\n", "for node in people['nodes']:\n", " id = node['id']\n", " firstName = node['givenName']\n", " surname = node['familyName']\n", " name = node['name']\n", "# TODO: Remove if we manage to search only individual fields\n", " if not testIfPresentCaseInsensitive(name_query, name.lower()):\n", " continue \n", " orcidHref = \"\"\n", " if id is not None and id != \"\":\n", " orcidHref = \"[\"+ name +\"](\"+ id +\")\" \n", " affiliations = []\n", " for affiliation in node['affiliation']:\n", " affiliations.append(affiliation['name'])\n", " works = \"\"\n", " if 'works' in node:\n", " for work in node['works']['nodes']:\n", " titles = []\n", " for title in work['titles']:\n", " titles.append(shorten(title['title'], width=50, placeholder=\"...\"))\n", " creators = []\n", " cnt = 0\n", " for creator in work['creators']:\n", " cnt += 1\n", " # Restrict display to the first author only \n", " if (cnt > 1):\n", " creators[-1] += \" et al.\"\n", " break\n", " if creator['id'] is not None:\n", " creators.append(\"[\" + creator['name'] + \"](\" + creator['id'] + \")\")\n", " else:\n", " creators.append(creator['name'])\n", " \n", " works += '; '.join(creators) + \" (\" + str(work['publicationYear']) + \") [\"+ ', '.join(titles) +\"](\"+ work['id'] + \") *\" + work['publisher'] + \"*
\" \n", " \n", " tableBody.add(firstName + \" | \" + surname + \" | \" + orcidHref + \" | \" + '
'.join(sorted(affiliations)) + \" | \" + works)\n", "display(Markdown(\"| First Name | Surname | Link to ORCID | Affiliations | Works | \\n|---|---|---|---|---|\\n%s\" % '\\n'.join(tableBody)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }