{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) WP2 [User Story 9](https://github.com/datacite/freya/issues/26) | As a bibliometrician, I want to know all the co-authors of a particular researcher, so that I can do a network analysis of the researcher's collaborations.\n",
    " :------------- | :------------- | :-------------\n",
    "\n",
    "A number of useful analyses are made possible by identifying co-authorship groups of a given researcher, for example identifying other active scientists in the researcher's field of study, or groups of closely collaborating (and often co-funded) author affiliations.<p />\n",
    "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve all publications of [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366).\n",
    "\n",
    "**Goal**: By the end of this notebook, for a researcher of interest, you should be able to:\n",
    "- Display an interactive sankey plot of the researcher's publication co-authors, e.g. <br> <img src=\"example_plot.png\" width=\"286\" height=\"209\" />\n",
    "- Download a file containing their publication DOIs;\n",
    "- Load the above file into [VOSviewer](https://www.vosviewer.com/) and then construct and visualise the researcher's co-authorship network, following the steps listed in the notebook, e.g. <br> <img src=\"VOSviewer_network.png\" width=\"305\" height=\"188\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install libraries and prepare GraphQL client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Install required Python packages\n",
    "!pip install gql requests numpy plotly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare the GraphQL client\n",
    "import requests\n",
    "from IPython.display import display, Markdown\n",
    "from gql import gql, Client\n",
    "from gql.transport.requests import RequestsHTTPTransport\n",
    "\n",
    "_transport = RequestsHTTPTransport(\n",
    "    url='https://api.datacite.org/graphql',\n",
    "    use_json=True,\n",
    ")\n",
    "\n",
    "client = Client(\n",
    "    transport=_transport,\n",
    "    fetch_schema_from_transport=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define and run GraphQL query\n",
    "Define the GraphQL query to find all publications including co-authors for [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the GraphQL query: find all publications, including co-authors or researcher id: \"https://orcid.org/0000-0002-6294-6366\"\n",
    "query_params = {\n",
    "    \"researcherId\" : \"https://orcid.org/0000-0002-6294-6366\",\n",
    "    \"maxWorks\" : 300\n",
    "}\n",
    "\n",
    "query = gql(\"\"\"query getResearcherPublication($researcherId: ID!, $maxWorks: Int!)\n",
    "{\n",
    "  person(id: $researcherId) {\n",
    "    id\n",
    "    name\n",
    "    publications(first:$maxWorks) {\n",
    "      totalCount\n",
    "      published {\n",
    "        title\n",
    "        count\n",
    "      }\n",
    "      nodes {\n",
    "        id\n",
    "        type\n",
    "        versionOfCount\n",
    "        titles {\n",
    "          title\n",
    "        }\n",
    "        creators {\n",
    "          id\n",
    "          name\n",
    "        }\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "}\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the above query via the GraphQL client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "data = client.execute(query, variable_values=json.dumps(query_params))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display total number of publications by the researcher\n",
    "Display the total number of the researcher's outputs to date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "89"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Get the total number of publication to date\n",
    "publications = data['person']['publications']\n",
    "display(Markdown(str(publications['totalCount'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot the researcher's publications co-authors\n",
    "Display a sankey plot of the co-authors sharing **at least two** publications with the researcher, highlighting them by frequency of co-authorship."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "### [Teichmann, Sarah](https://orcid.org/0000-0002-6294-6366)'s first degree co-authors:"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"1000\"\n",
       "            height=\"800\"\n",
       "            src=\"./out.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x11d9e5588>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import plotly.graph_objects as go\n",
    "import plotly.io as pio\n",
    "import plotly.express as px\n",
    "from IPython.display import IFrame\n",
    "\n",
    "# Retrieve creator names and ORCID ids from all publications\n",
    "all_creator_ids = []\n",
    "all_creator_ids_set = set([])\n",
    "creator_id2name = {}\n",
    "publications = data['person']['publications']\n",
    "for r in publications['nodes']:\n",
    "    if r['versionOfCount'] > 0:\n",
    "        # If the current output is a version of another one, exclude it\n",
    "        continue\n",
    "    creator_ids = list(filter(None, [s['id'] for s in r['creators']]))\n",
    "    all_creator_ids_set.update(creator_ids)\n",
    "    all_creator_ids.append(creator_ids)\n",
    "    for creator in r['creators']:\n",
    "        if (creator['id'] not in creator_id2name and creator['id'] is not None):\n",
    "            creator_id2name[creator['id']] = creator['name']\n",
    "            \n",
    "# Collect creator names into all_unique_creator_names - these will be labels in the sankey plot\n",
    "# Initialise coauthorship_matrix, that will be used to populate lists needed for the sankey plot\n",
    "all_unique_creator_ids = list(all_creator_ids_set)\n",
    "length = len(all_unique_creator_ids)\n",
    "coauthorship_matrix = []\n",
    "all_unique_creator_names = []\n",
    "for id in all_unique_creator_ids:\n",
    "    all_unique_creator_names.append(creator_id2name[id])\n",
    "    coauthorship_matrix.append([0] * length)\n",
    "    \n",
    "# Populate coauthorship_matrix\n",
    "for cids in all_creator_ids:\n",
    "    for cid in cids:\n",
    "        c_pos = all_unique_creator_ids.index(cid)\n",
    "        for cid in cids:\n",
    "            co_pos = all_unique_creator_ids.index(cid)\n",
    "            if c_pos != co_pos:\n",
    "                coauthorship_matrix[c_pos][co_pos] += 1\n",
    "                \n",
    "# Use coauthorship_matrix to populate lists needed for the sankey diagram: sourceIndexes, targetIndexes and linkWeights\n",
    "# For Plotly colour swatches, see: https://plotly.com/python/builtin-colorscales/\n",
    "colRange = px.colors.sequential.matter;\n",
    "maxColIndex = len(colRange)\n",
    "sourceIndexes = []\n",
    "targetIndexes = []\n",
    "linkWeights = []\n",
    "linkColours = []\n",
    "for c_pos, r in enumerate(coauthorship_matrix):\n",
    "    # On the left hand side of sankey retain only the researcher in question\n",
    "    if all_unique_creator_ids[c_pos] != query_params['researcherId']:\n",
    "        continue\n",
    "    for co_pos, weight in enumerate(r):\n",
    "            if coauthorship_matrix[c_pos][co_pos] > 1:\n",
    "                # Include links to co-authors of at least 2 publications                 \n",
    "                sourceIndexes.append(c_pos)\n",
    "                targetIndexes.append(co_pos)\n",
    "                linkWeights.append(weight)\n",
    "                linkColours.append(colRange[min(maxColIndex, weight)])\n",
    "\n",
    "# Create a sankey plot \n",
    "fig = go.Figure(data=[go.Sankey(\n",
    "    node = dict(\n",
    "      pad = 15,\n",
    "      thickness = 20,\n",
    "      line = dict(color = \"black\", width = 0.5),\n",
    "      label = all_unique_creator_names,\n",
    "      color = \"rgba(136,65,157, 0.6)\"\n",
    "    ),\n",
    "    link = dict(\n",
    "      source = sourceIndexes, # indices correspond to labels in all_unique_creator_names\n",
    "      target = targetIndexes, # ditto\n",
    "      value = linkWeights,\n",
    "      color = linkColours\n",
    "  ))])\n",
    "\n",
    "fig.update_layout(title_text=\"\", font_size=10)\n",
    "# Write interactive plot out to html file\n",
    "pio.write_html(fig, file='out.html')\n",
    "\n",
    "# Display plot from the saved html file\n",
    "display(Markdown(\"### [%s](%s)'s first degree co-authors:\" % (creator_id2name[query_params['researcherId']], query_params['researcherId'])))\n",
    "IFrame(src=\"./out.html\", width=1000, height=800)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download a file containing publication DOIs\n",
    "This file can be loaded into [VOSviewer](https://www.vosviewer.com/) tool in order to construct and visualise the researcher's co-authorship network, using the following steps (see the image below):\n",
    "1. Select *File* tab on the right, then click on **Create** button\n",
    "2. In the *Choose type of data* window, select **Create a map based on biobliographic data**\n",
    "3. In the *Choose data source* window, select **Download data through API**\n",
    "4. In the *Specify search query or select file* select **DOI** tab, then *API*: **Crossref**, then in the *DOI files* text box type in or select the path to the file of DOIs you downloaded.\n",
    "6. Click on **Finish** button to construct and display the network. \n",
    "![VOSviewer Steps](VOSviewer_steps.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/javascript": [
       "\n",
       "var csv = 'https://doi.org/10.17863/cam.6611\\nhttps://doi.org/10.17863/cam.8651\\nhttps://doi.org/10.17863/cam.9293\\nhttps://doi.org/10.17863/cam.9696\\nhttps://doi.org/10.17863/cam.20316\\nhttps://doi.org/10.17863/cam.26822\\nhttps://doi.org/10.17863/cam.27576\\nhttps://doi.org/10.17863/cam.32384\\nhttps://doi.org/10.17863/cam.33936\\nhttps://doi.org/10.17863/cam.37473\\nhttps://doi.org/10.17863/cam.39438\\nhttps://doi.org/10.17863/cam.39791\\nhttps://doi.org/10.17863/cam.39961\\nhttps://doi.org/10.17863/cam.39963\\nhttps://doi.org/10.17863/cam.39964\\nhttps://doi.org/10.17863/cam.39965\\nhttps://doi.org/10.17863/cam.39966\\nhttps://doi.org/10.17863/cam.39967\\nhttps://doi.org/10.17863/cam.39969\\nhttps://doi.org/10.17863/cam.39971\\nhttps://doi.org/10.17863/cam.39972\\nhttps://doi.org/10.17863/cam.39984\\nhttps://doi.org/10.17863/cam.39985\\nhttps://doi.org/10.17863/cam.39986\\nhttps://doi.org/10.17863/cam.39987\\nhttps://doi.org/10.17863/cam.40022\\nhttps://doi.org/10.17863/cam.40032\\nhttps://doi.org/10.17863/cam.40023\\nhttps://doi.org/10.17863/cam.40025\\nhttps://doi.org/10.17863/cam.40026\\nhttps://doi.org/10.17863/cam.40027\\nhttps://doi.org/10.17863/cam.40029\\nhttps://doi.org/10.17863/cam.40039\\nhttps://doi.org/10.17863/cam.40030\\nhttps://doi.org/10.17863/cam.40034\\nhttps://doi.org/10.17863/cam.40036\\nhttps://doi.org/10.17863/cam.40037\\nhttps://doi.org/10.17863/cam.40038\\nhttps://doi.org/10.17863/cam.40040\\nhttps://doi.org/10.17863/cam.40041\\nhttps://doi.org/10.17863/cam.40043\\nhttps://doi.org/10.17863/cam.40045\\nhttps://doi.org/10.17863/cam.40046\\nhttps://doi.org/10.17863/cam.40047\\nhttps://doi.org/10.17863/cam.40048\\nhttps://doi.org/10.17863/cam.40044\\nhttps://doi.org/10.17863/cam.40049\\nhttps://doi.org/10.17863/cam.40050\\nhttps://doi.org/10.17863/cam.40051\\nhttps://doi.org/10.17863/cam.44557\\nhttps://doi.org/10.17863/cam.44717\\nhttps://doi.org/10.1038/s41591-019-0468-5\\nhttps://doi.org/10.17863/cam.46893\\nhttps://doi.org/10.17863/cam.47089\\nhttps://doi.org/10.17863/cam.47221\\nhttps://doi.org/10.17863/cam.47223\\nhttps://doi.org/10.17863/cam.47246\\nhttps://doi.org/10.17863/cam.47621\\nhttps://doi.org/10.17863/cam.48303\\nhttps://doi.org/10.1038/s41592-019-0692-4\\nhttps://doi.org/10.1038/s41467-019-14171-5\\nhttps://doi.org/10.1101/2020.01.28.911115\\nhttps://doi.org/10.1101/2019.12.12.871657\\nhttps://doi.org/10.1126/science.aat1699\\nhttps://doi.org/10.1038/s41467-018-07307-6\\nhttps://doi.org/10.1126/science.aan6828\\nhttps://doi.org/10.1101/709998\\nhttps://doi.org/10.1038/s41556-019-0333-2\\nhttps://doi.org/10.1101/413047\\nhttps://doi.org/10.1126/science.aat5031\\nhttps://doi.org/10.1038/s41467-019-11266-x\\nhttps://doi.org/10.1101/338178\\nhttps://doi.org/10.1038/s41467-018-07771-0\\nhttps://doi.org/10.1101/309831\\nhttps://doi.org/10.1101/397042\\nhttps://doi.org/10.26508/lsa.201800124\\nhttps://doi.org/10.1038/s41592-018-0254-1\\nhttps://doi.org/10.1038/s41592-018-0082-3\\nhttps://doi.org/10.1126/sciimmunol.aal2192\\nhttps://doi.org/10.1242/dev.152561\\nhttps://doi.org/10.1126/science.aah4115\\nhttps://doi.org/10.17863/cam.50163\\nhttps://doi.org/10.17863/cam.50702\\nhttps://doi.org/10.1101/gr.207704.116\\nhttps://doi.org/10.1038/s41590-020-0602-z\\nhttps://doi.org/10.1126/science.aay3224\\nhttps://doi.org/10.17863/cam.52009\\nhttps://doi.org/10.17863/cam.52896\\nhttps://doi.org/10.17863/cam.53893\\n';\n",
       "\n",
       "var filename = '0000-0002-6294-6366_dois.csv';\n",
       "var blob = new Blob([csv], { type: 'application/x-bibtex;charset=utf-8;' });\n",
       "if (navigator.msSaveBlob) { // IE 10+\n",
       "    navigator.msSaveBlob(blob, filename);\n",
       "} else {\n",
       "    var link = document.createElement(\"a\");\n",
       "    if (link.download !== undefined) { // feature detection\n",
       "        // Browsers that support HTML5 download attribute\n",
       "        var url = URL.createObjectURL(blob);\n",
       "        link.setAttribute(\"href\", url);\n",
       "        link.setAttribute(\"download\", filename);\n",
       "        link.style.visibility = 'hidden';\n",
       "        document.body.appendChild(link);\n",
       "        link.click();\n",
       "        document.body.removeChild(link);\n",
       "    }\n",
       "}\n"
      ],
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import Javascript\n",
    "from requests.utils import requote_uri\n",
    "\n",
    "# Collect publication DOIs so that it can be downloaded\n",
    "dois = []\n",
    "publications = data['person']['publications']\n",
    "for n in publications['nodes']:\n",
    "    if n['versionOfCount'] > 0:\n",
    "        # If the current output is a version of another one, exclude it\n",
    "        continue\n",
    "    dois.append(n['id'])\n",
    "df = pd.DataFrame(dois, columns = None)\n",
    "file_name = \"%s_dois.csv\" % query_params['researcherId'].split(\"/\")[-1]\n",
    "\n",
    "js_download = \"\"\"\n",
    "var csv = '%s';\n",
    "\n",
    "var filename = '%s';\n",
    "var blob = new Blob([csv], { type: 'application/x-bibtex;charset=utf-8;' });\n",
    "if (navigator.msSaveBlob) { // IE 10+\n",
    "    navigator.msSaveBlob(blob, filename);\n",
    "} else {\n",
    "    var link = document.createElement(\"a\");\n",
    "    if (link.download !== undefined) { // feature detection\n",
    "        // Browsers that support HTML5 download attribute\n",
    "        var url = URL.createObjectURL(blob);\n",
    "        link.setAttribute(\"href\", url);\n",
    "        link.setAttribute(\"download\", filename);\n",
    "        link.style.visibility = 'hidden';\n",
    "        document.body.appendChild(link);\n",
    "        link.click();\n",
    "        document.body.removeChild(link);\n",
    "    }\n",
    "}\n",
    "\"\"\" % (df.to_csv(index=False, header=False).replace('\\n','\\\\n').replace(\"\\'\",\"\\\\'\").replace(\"\\\"\",\"\").replace(\"\\r\",\"\"), file_name)\n",
    "    \n",
    "display(Javascript(js_download))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "\n",
       "## [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366)'s co-authorship network as shown in VOSviewer\n",
       "Interestingly, the network (excluding publications with author lists longer than 25) shows clusters with at least three versions of the researcher's author name:\n",
       "- Teichmann Sarah A.\n",
       "- Teichmann Sarah A\n",
       "- Teichmann Sarah\n",
       "![VOSviewer Network](VOSviewer_network.png)\n"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# This section contains an example of co-authorship network for Dr Sarah Teichmann's publications - hence the conditional logic below\n",
    "if query_params['researcherId'] == \"https://orcid.org/0000-0002-6294-6366\":\n",
    "    display(Markdown(\"\"\"\n",
    "## [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366)'s co-authorship network as shown in VOSviewer\n",
    "Interestingly, the network (excluding publications with author lists longer than 25) shows clusters with at least three versions of the researcher's author name:\n",
    "- Teichmann Sarah A.\n",
    "- Teichmann Sarah A\n",
    "- Teichmann Sarah\n",
    "![VOSviewer Network](VOSviewer_network.png)\n",
    "\"\"\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}