{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) | WP2 [User Story 6](https://www.pidforum.org/t/pid-graph-graphql-example-disambiguate-researchers/931): As a researcher, I am looking for more information about another researcher with a common name, but don’t know his/her ORCID ID.\n",
    ":------------- | :------------- | :-------------\n",
    "\n",
    "It is important to be able to locate a researcher of interest even though their ORCID ID is unknown. For example, a reader of a scientific publication may wish to find out more about one of the authors, whereby the publisher has not cross-referenced that author's name to ORCID.<p />\n",
    "\n",
    "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to disambiguate a researcher name via a *funnel* approach:\n",
    " * First all researcher records matching query \"John AND Smith\" and retrieved, and an alphabetically sorted list of affiliations and the corresponding researcher names is displayed;\n",
    " * Then the notebook simulates the user selecting one of the affiliations (in our case \"University of Arizona\"), and then performs a more detailed query: \"John AND Smith AND University of Arizona\". The second query retrieves and displays a much smaller set of results, now also containing the researcher's publications, thus helping the user pinpoint the researcher of interest more easily.\n",
    "\n",
    "**Goal**: By the end of this notebook, you should be able successfully disambiguate a researcher name of interest."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install libraries and prepare GraphQL client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 228,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Install required Python packages\n",
    "!pip install gql requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 229,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare the GraphQL client\n",
    "import requests\n",
    "from IPython.display import display, Markdown\n",
    "from gql import gql, Client\n",
    "from gql.transport.requests import RequestsHTTPTransport\n",
    "\n",
    "_transport = RequestsHTTPTransport(\n",
    "    url='https://api.datacite.org/graphql',\n",
    "    use_json=True,\n",
    ")\n",
    "\n",
    "client = Client(\n",
    "    transport=_transport,\n",
    "    fetch_schema_from_transport=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define and run GraphQL query\n",
    "Define the GraphQL query to find all publications including co-authors for [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 231,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the GraphQL query to retrieve up to 100 researchers matching query \"John and Smith\"\n",
    "query_params = {\n",
    "    \"query\" : \"John AND Smith\",\n",
    "    \"max_researchers\" : 100,\n",
    "    \"query_end_cursor\" : \"\"\n",
    "}\n",
    "\n",
    "query_str = \"\"\"query getResearchersByName(\n",
    "    $query: String!,\n",
    "    $max_researchers: Int!,\n",
    "    $query_end_cursor : String!\n",
    "    )\n",
    "{\n",
    "  people(query: $query, first: $max_researchers, after: $query_end_cursor) {\n",
    "    totalCount\n",
    "    pageInfo {\n",
    "      hasNextPage\n",
    "      endCursor\n",
    "    }  \n",
    "    nodes {\n",
    "      id\n",
    "      givenName\n",
    "      familyName\n",
    "      name\n",
    "      affiliation {\n",
    "        name\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the above query via the GraphQL client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 232,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "found_next_page = True\n",
    "\n",
    "# Initialise overall data dict that will store results\n",
    "data = {}\n",
    "\n",
    "# Keep retrieving results until there are no more results left\n",
    "while True:\n",
    "    query = gql(\"%s\" % query_str)\n",
    "    res = client.execute(query, variable_values=json.dumps(query_params))\n",
    "    if \"people\" not in data:\n",
    "        data = res\n",
    "    else:\n",
    "        people = res[\"people\"]\n",
    "        data[\"people\"][\"nodes\"].extend(people[\"nodes\"])\n",
    "        pageInfo = people[\"pageInfo\"]\n",
    "        if pageInfo[\"hasNextPage\"]:\n",
    "            if pageInfo[\"endCursor\"] is not None:\n",
    "                query_params[\"query_end_cursor\"] = pageInfo[\"endCursor\"]            \n",
    "            else:\n",
    "                break\n",
    "        else:\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## List researcher details\n",
    "List in tabular format affilitions and the corresponding researcher names. This allows the user to select one of the affiliations to use in a more detailed query (see below) that also retrieves publications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 234,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Total number of researchers found: **210**<br>The list of researchers by affiliation is as follows:"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "| Affiliation | Researcher Names |\n",
       "|---|---|\n",
       "American Chemical Society | John Smith\n",
       "American Science and Engineering, Inc. | Henry John Peter Smith\n",
       "Bank Street College of Education | John Smith\n",
       "Bedford Institute of Oceanography | John Smith\n",
       "Beecham Pharmaceuticals | John Smith\n",
       "Birkenhead High School Academy | John Arthur Smith\n",
       "Bureau of Ocean Energy Management, Pacific OCS Region | John Smith\n",
       "CU Sports Medicine and Performance | John-Rudolph Smith\n",
       "Charles Sturt University - Wagga Wagga Campus | John Smith\n",
       "Church of Norway | John Arthur Smith\n",
       "City College of New York | John Smith Del Rosario\n",
       "Colorado School of Mines | John Smith\n",
       "Cornell University | John-David Smith\n",
       "CottonInfo | John Smith\n",
       "Drew University | John Smith\n",
       "East Carolina University | John Smith\n",
       "Fairleigh Dickinson University | John Smith\n",
       "Federation of Liberian Youth - FLY | John Solunta Smith Jr\n",
       "Fire Risk Assessment Network | John Smith\n",
       "Flagburn Health Center | John Smith\n",
       "Fluent Technology | John Smith\n",
       "George Washington University | John Smith\n",
       "Georgia State University | John Smith\n",
       "GlaxoSmithKline Plc | John Smith\n",
       "Lipscomb University | John Smith\n",
       "London University | John Smith\n",
       "Louisiana State University | John F. Smith\n",
       "MSG Software (USA), Inc. | Henry John Peter Smith\n",
       "Manhattan College | Henry John Peter Smith\n",
       "Michigan State University | John Smith\n",
       "Millersville University | John Smith\n",
       "NASA Langley Research Center | John Smith\n",
       "New South Wales Department of Primary Industries Agriculture | John Smith\n",
       "Northeastern University | Henry John Peter Smith\n",
       "Northwestern University | John F. Smith\n",
       "Nova Scotia Health Authority South Western Nova Scotia | John Smith\n",
       "OCS Energy Consultant | John Smith\n",
       "Ohio State University | John R. Smith\n",
       "Oxford University Press | John Arthur Smith\n",
       "Peking University | John Solunta Smith Jr\n",
       "Pennsylvania State University | John Smith\n",
       "Proof Read My File | John Smith\n",
       "RMIT University City Campus | John Smith\n",
       "Retired | John Arthur Smith\n",
       "Rutgers New Jersey Medical School | John Smith Del Rosario\n",
       "Rutgers University Camden | John Smith\n",
       "Sample invited position | John Smith\n",
       "Sigma Xi the Scientific Research Society | John Smith\n",
       "TPE Associates Inc | Henry John Peter Smith\n",
       "Technical Support | John Smith\n",
       "Tennessee Technological University | John Smith\n",
       "The New School for Social Research | John Smith\n",
       "The University of St Andrews | Christopher John Smith\n",
       "Tufts University | Henry John Peter Smith\n",
       "Ulster Univeristy | John Smith\n",
       "Ulster University | John Smith\n",
       "University College London | John Smith\n",
       "University at Buffalo | John Smith\n",
       "University of Arizona | Smith, John E. 3rd\n",
       "University of California Davis | John R Smith\n",
       "University of Cambridge | John Arthur Smith\n",
       "University of Central Missouri | John Smith\n",
       "University of Colorado | John Smith\n",
       "University of Colorado Boulder | JOHN SMITH, John Smith\n",
       "University of Liverpool | John Arthur Smith, Quintin-John Smith\n",
       "University of Michigan | John R. Smith\n",
       "University of Missouri Columbia | John Smith\n",
       "University of Ottawa | John Smith\n",
       "University of Oxford | Christopher John Smith\n",
       "University of Pennsylvania | John F. Smith, John Smith\n",
       "University of St Andrews | Christopher John Smith\n",
       "University of Strathclyde | John Smith\n",
       "University of Toledo | John-David Smith\n",
       "University of Toronto | John Smith\n",
       "University of Virginia | Smith, John E. 3rd\n",
       "University of York | John Smith\n",
       "Vanderbilt University | John Smith\n",
       "Virginia Commonwealth University | John Lee Smith\n",
       "Visidyne, Inc. | Henry John Peter Smith\n",
       "Yale University | John Smith\n"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Collect names and affiliations for the researchers found\n",
    "# Test if fieldValue matches (case-insensitively) a Solr-style query (with \" AND \" representing the logical AND, and \" \" representing the logical OR)\n",
    "def testIfPresentCaseInsensitive(solrQuery, fieldValueLowerCase):\n",
    "    for orTerms in solrQuery.split(\" AND \"):\n",
    "        present = False\n",
    "        for term in orTerms.split(\" \"):\n",
    "            if term.lower() in fieldValueLowerCase:\n",
    "                present = True\n",
    "                break\n",
    "        if not present:\n",
    "            return False\n",
    "    return True\n",
    "\n",
    "people = data['people']\n",
    "af2Names = {}\n",
    "totalCount = 0\n",
    "for node in people['nodes']:\n",
    "    id = node['id']\n",
    "    name = node['name']\n",
    "#     TODO: Remove if we manage to search only individual fields\n",
    "    if not testIfPresentCaseInsensitive(query_params['query'], name.lower()):\n",
    "        continue\n",
    "    totalCount += 1\n",
    "    for af in node['affiliation']:\n",
    "        affiliation = af['name']\n",
    "        if affiliation not in af2Names:\n",
    "            af2Names[affiliation] = set()\n",
    "        af2Names[affiliation].add(name)\n",
    "\n",
    "tableBody = \"\"\n",
    "for af,names in sorted(af2Names.items()):\n",
    "    tableBody += af + \" | \" + ', '.join(names) + \"\\n\"\n",
    "display(Markdown(\"Total number of researchers found: **%d**<br>The list of researchers by affiliation is as follows:\" % totalCount))\n",
    "display(Markdown(\"\"))\n",
    "\n",
    "display(Markdown(\"| Affiliation | Researcher Names |\\n|---|---|\\n%s\" % tableBody))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 235,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the GraphQL query to retrieve all researchers matching query \"John and Smith\" and affiliation \"University of Arizona\", now with works\n",
    "name_query = \"John AND Smith\"\n",
    "affiliation_query = \"\\\"University of Arizona\\\"\"\n",
    "query_params1 = {\n",
    "    \"query\" : name_query + \" AND \" + affiliation_query,\n",
    "    \"max_researchers\" : 10,\n",
    "    \"query_end_cursor\" : \"\"    \n",
    "}\n",
    "\n",
    "query_str = \"\"\"query getResearchersByName(\n",
    "    $query: String!,\n",
    "    $max_researchers: Int!,\n",
    "    $query_end_cursor : String!\n",
    "    )\n",
    "{\n",
    "  people(query: $query, first: $max_researchers, after: $query_end_cursor) {\n",
    "    totalCount\n",
    "    pageInfo {\n",
    "      hasNextPage\n",
    "      endCursor\n",
    "    }      \n",
    "    nodes {\n",
    "      id\n",
    "      givenName\n",
    "      familyName\n",
    "      name\n",
    "      affiliation {\n",
    "        name\n",
    "      }\n",
    "      works(first: 3) {\n",
    "        nodes {\n",
    "          id\n",
    "          publicationYear\n",
    "          publisher\n",
    "          titles {\n",
    "            title\n",
    "          }\n",
    "          creators {\n",
    "            id\n",
    "            name\n",
    "            affiliation {\n",
    "              id\n",
    "              name\n",
    "            }\n",
    "          }\n",
    "          subjects {\n",
    "            subject\n",
    "          }\n",
    "        }\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the above query via the GraphQL client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 236,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "found_next_page = True\n",
    "\n",
    "# Initialise overall data dict that will store results\n",
    "data1 = {}\n",
    "\n",
    "# Keep retrieving results until there are no more results left\n",
    "while True:\n",
    "    query = gql(\"%s\" % query_str)\n",
    "    res = client.execute(query, variable_values=json.dumps(query_params1))\n",
    "    if \"people\" not in data1:\n",
    "        data1 = res\n",
    "    else:\n",
    "        people = res[\"people\"]\n",
    "        data1[\"people\"][\"nodes\"].extend(people[\"nodes\"])\n",
    "        pageInfo = people[\"pageInfo\"]\n",
    "        if pageInfo[\"hasNextPage\"]:\n",
    "            if pageInfo[\"endCursor\"] is not None:\n",
    "                query_params[\"query_end_cursor\"] = pageInfo[\"endCursor\"]            \n",
    "            else:\n",
    "                break\n",
    "        else:\n",
    "            break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 237,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "| First Name | Surname | Link to ORCID | Affiliations | Works | \n",
       "|---|---|---|---|---|\n",
       "John E | Smith | [Smith, John E. 3rd](https://orcid.org/0000-0002-0888-1274) | University of Arizona<br>University of Virginia | [Smith, John Edward](https://orcid.org/0000-0002-0888-1274) (2020) [CS_216516.sf3](https://doi.org/10.7910/dvn/ozqd7g/1a1qty) *Harvard Dataverse*<br>[Smith, John Edward](https://orcid.org/0000-0002-0888-1274) (2020) [human N2Aus PKA phosphorylation](https://doi.org/10.7910/dvn/ozqd7g) *Harvard Dataverse*<br>[Lostal, William](https://orcid.org/0000-0003-1014-1950) et al. (2019) [Titin splicing regulates cardiotoxicity...](https://doi.org/10.1126/scitranslmed.aat6072) *American Association for the Advancement of Science (AAAS)*<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from textwrap import shorten\n",
    "\n",
    "# Collect all relevant details for the researchers found\n",
    "tableBody=set()\n",
    "people = data1['people']\n",
    "for node in people['nodes']:\n",
    "    id = node['id']\n",
    "    firstName = node['givenName']\n",
    "    surname = node['familyName']\n",
    "    name = node['name']\n",
    "#     TODO: Remove if we manage to search only individual fields\n",
    "    if not testIfPresentCaseInsensitive(name_query, name.lower()):\n",
    "        continue    \n",
    "    orcidHref = \"\"\n",
    "    if id is not None and id != \"\":\n",
    "        orcidHref = \"[\"+ name +\"](\"+ id +\")\"    \n",
    "    affiliations = []\n",
    "    for affiliation in node['affiliation']:\n",
    "        affiliations.append(affiliation['name'])\n",
    "    works = \"\"\n",
    "    if 'works' in node:\n",
    "        for work in node['works']['nodes']:\n",
    "            titles = []\n",
    "            for title in work['titles']:\n",
    "                titles.append(shorten(title['title'], width=50, placeholder=\"...\"))\n",
    "            creators = []\n",
    "            cnt = 0\n",
    "            for creator in work['creators']:\n",
    "                cnt += 1\n",
    "                # Restrict display to the first author only                 \n",
    "                if (cnt > 1):\n",
    "                    creators[-1] += \" et al.\"\n",
    "                    break\n",
    "                if creator['id'] is not None:\n",
    "                    creators.append(\"[\" + creator['name'] + \"](\" + creator['id'] + \")\")\n",
    "                else:\n",
    "                    creators.append(creator['name'])\n",
    "            \n",
    "            works += '; '.join(creators) + \" (\" + str(work['publicationYear']) + \") [\"+ ', '.join(titles) +\"](\"+ work['id'] + \") *\" + work['publisher'] + \"*<br>\" \n",
    "        \n",
    "    tableBody.add(firstName + \" | \" + surname + \" | \" + orcidHref + \" | \" + '<br>'.join(sorted(affiliations)) + \" | \" + works)\n",
    "display(Markdown(\"| First Name | Surname | Link to ORCID | Affiliations | Works | \\n|---|---|---|---|---|\\n%s\" % '\\n'.join(tableBody)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}