{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ " ![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) WP2 [User Story 9](https://github.com/datacite/freya/issues/26) | As a bibliometrician, I want to know all the co-authors of a particular researcher, so that I can do a network analysis of the researcher's collaborations.\n", " :------------- | :------------- | :-------------\n", "\n", "A number of useful analyses are made possible by identifying co-authorship groups of a given researcher, for example identifying other active scientists in the researcher's field of study, or groups of closely collaborating (and often co-funded) author affiliations.

\n", "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve all publications of [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366).\n", "\n", "**Goal**: By the end of this notebook, for a researcher of interest, you should be able to:\n", "- Display an interactive sankey plot of the researcher's publication co-authors, e.g.
\n", "- Download a file containing their publication DOIs;\n", "- Load the above file into [VOSviewer](https://www.vosviewer.com/) and then construct and visualise the researcher's co-authorship network, following the steps listed in the notebook, e.g.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install libraries and prepare GraphQL client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Install required Python packages\n", "!pip install gql requests numpy plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prepare the GraphQL client\n", "import requests\n", "from IPython.display import display, Markdown\n", "from gql import gql, Client\n", "from gql.transport.requests import RequestsHTTPTransport\n", "\n", "_transport = RequestsHTTPTransport(\n", " url='https://api.datacite.org/graphql',\n", " use_json=True,\n", ")\n", "\n", "client = Client(\n", " transport=_transport,\n", " fetch_schema_from_transport=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define and run GraphQL query\n", "Define the GraphQL query to find all publications including co-authors for [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate the GraphQL query: find all publications, including co-authors or researcher id: \"https://orcid.org/0000-0002-6294-6366\"\n", "query_params = {\n", " \"researcherId\" : \"https://orcid.org/0000-0002-6294-6366\",\n", " \"maxWorks\" : 300\n", "}\n", "\n", "query = gql(\"\"\"query getResearcherPublication($researcherId: ID!, $maxWorks: Int!)\n", "{\n", " person(id: $researcherId) {\n", " id\n", " name\n", " publications(first:$maxWorks) {\n", " totalCount\n", " published {\n", " title\n", " count\n", " }\n", " nodes {\n", " id\n", " type\n", " versionOfCount\n", " titles {\n", " title\n", " }\n", " creators {\n", " id\n", " name\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the above query via the GraphQL client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "data = client.execute(query, variable_values=json.dumps(query_params))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display total number of publications by the researcher\n", "Display the total number of the researcher's outputs to date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the total number of publication to date\n", "publications = data['person']['publications']\n", "display(Markdown(str(publications['totalCount'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot the researcher's publications co-authors\n", "Display a sankey plot of the co-authors sharing **at least two** publications with the researcher, highlighting them by frequency of co-authorship." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "import plotly.express as px\n", "from IPython.display import IFrame\n", "\n", "# Retrieve creator names and ORCID ids from all publications\n", "all_creator_ids = []\n", "all_creator_ids_set = set([])\n", "creator_id2name = {}\n", "publications = data['person']['publications']\n", "for r in publications['nodes']:\n", " if r['versionOfCount'] > 0:\n", " # If the current output is a version of another one, exclude it\n", " continue\n", " creator_ids = list(filter(None, [s['id'] for s in r['creators']]))\n", " all_creator_ids_set.update(creator_ids)\n", " all_creator_ids.append(creator_ids)\n", " for creator in r['creators']:\n", " if (creator['id'] not in creator_id2name and creator['id'] is not None):\n", " creator_id2name[creator['id']] = creator['name']\n", " \n", "# Collect creator names into all_unique_creator_names - these will be labels in the sankey plot\n", "# Initialise coauthorship_matrix, that will be used to populate lists needed for the sankey plot\n", "all_unique_creator_ids = list(all_creator_ids_set)\n", "length = len(all_unique_creator_ids)\n", "coauthorship_matrix = []\n", "all_unique_creator_names = []\n", "for id in all_unique_creator_ids:\n", " all_unique_creator_names.append(creator_id2name[id])\n", " coauthorship_matrix.append([0] * length)\n", " \n", "# Populate coauthorship_matrix\n", "for cids in all_creator_ids:\n", " for cid in cids:\n", " c_pos = all_unique_creator_ids.index(cid)\n", " for cid in cids:\n", " co_pos = all_unique_creator_ids.index(cid)\n", " if c_pos != co_pos:\n", " coauthorship_matrix[c_pos][co_pos] += 1\n", " \n", "# Use coauthorship_matrix to populate lists needed for the sankey diagram: sourceIndexes, targetIndexes and linkWeights\n", "# For Plotly colour swatches, see: https://plotly.com/python/builtin-colorscales/\n", "colRange = px.colors.sequential.matter;\n", "maxColIndex = len(colRange)\n", "sourceIndexes = []\n", "targetIndexes = []\n", "linkWeights = []\n", "linkColours = []\n", "for c_pos, r in enumerate(coauthorship_matrix):\n", " # On the left hand side of sankey retain only the researcher in question\n", " if all_unique_creator_ids[c_pos] != query_params['researcherId']:\n", " continue\n", " for co_pos, weight in enumerate(r):\n", " if coauthorship_matrix[c_pos][co_pos] > 1:\n", " # Include links to co-authors of at least 2 publications \n", " sourceIndexes.append(c_pos)\n", " targetIndexes.append(co_pos)\n", " linkWeights.append(weight)\n", " linkColours.append(colRange[min(maxColIndex, weight)])\n", "\n", "# Create a sankey plot \n", "fig = go.Figure(data=[go.Sankey(\n", " node = dict(\n", " pad = 15,\n", " thickness = 20,\n", " line = dict(color = \"black\", width = 0.5),\n", " label = all_unique_creator_names,\n", " color = \"rgba(136,65,157, 0.6)\"\n", " ),\n", " link = dict(\n", " source = sourceIndexes, # indices correspond to labels in all_unique_creator_names\n", " target = targetIndexes, # ditto\n", " value = linkWeights,\n", " color = linkColours\n", " ))])\n", "\n", "fig.update_layout(title_text=\"\", font_size=10)\n", "# Write interactive plot out to html file\n", "pio.write_html(fig, file='out.html')\n", "\n", "# Display plot from the saved html file\n", "display(Markdown(\"### [%s](%s)'s first degree co-authors:\" % (creator_id2name[query_params['researcherId']], query_params['researcherId'])))\n", "IFrame(src=\"./out.html\", width=1000, height=800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download a file containing publication DOIs\n", "This file can be loaded into [VOSviewer](https://www.vosviewer.com/) tool in order to construct and visualise the researcher's co-authorship network, using the following steps (see the image below):\n", "1. Select *File* tab on the right, then click on **Create** button\n", "2. In the *Choose type of data* window, select **Create a map based on biobliographic data**\n", "3. In the *Choose data source* window, select **Download data through API**\n", "4. In the *Specify search query or select file* select **DOI** tab, then *API*: **Crossref**, then in the *DOI files* text box type in or select the path to the file of DOIs you downloaded.\n", "6. Click on **Finish** button to construct and display the network. \n", "![VOSviewer Steps](VOSviewer_steps.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from IPython.display import Javascript\n", "from requests.utils import requote_uri\n", "\n", "# Collect publication DOIs so that it can be downloaded\n", "dois = []\n", "publications = data['person']['publications']\n", "for n in publications['nodes']:\n", " if n['versionOfCount'] > 0:\n", " # If the current output is a version of another one, exclude it\n", " continue\n", " dois.append(n['id'])\n", "df = pd.DataFrame(dois, columns = None)\n", "file_name = \"%s_dois.csv\" % query_params['researcherId'].split(\"/\")[-1]\n", "\n", "js_download = \"\"\"\n", "var csv = '%s';\n", "\n", "var filename = '%s';\n", "var blob = new Blob([csv], { type: 'application/x-bibtex;charset=utf-8;' });\n", "if (navigator.msSaveBlob) { // IE 10+\n", " navigator.msSaveBlob(blob, filename);\n", "} else {\n", " var link = document.createElement(\"a\");\n", " if (link.download !== undefined) { // feature detection\n", " // Browsers that support HTML5 download attribute\n", " var url = URL.createObjectURL(blob);\n", " link.setAttribute(\"href\", url);\n", " link.setAttribute(\"download\", filename);\n", " link.style.visibility = 'hidden';\n", " document.body.appendChild(link);\n", " link.click();\n", " document.body.removeChild(link);\n", " }\n", "}\n", "\"\"\" % (df.to_csv(index=False, header=False).replace('\\n','\\\\n').replace(\"\\'\",\"\\\\'\").replace(\"\\\"\",\"\").replace(\"\\r\",\"\"), file_name)\n", " \n", "display(Javascript(js_download))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This section contains an example of co-authorship network for Dr Sarah Teichmann's publications - hence the conditional logic below\n", "if query_params['researcherId'] == \"https://orcid.org/0000-0002-6294-6366\":\n", " display(Markdown(\"\"\"\n", "## [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366)'s co-authorship network as shown in VOSviewer\n", "Interestingly, the network (excluding publications with author lists longer than 25) shows clusters with at least three versions of the researcher's author name:\n", "- Teichmann Sarah A.\n", "- Teichmann Sarah A\n", "- Teichmann Sarah\n", "![VOSviewer Network](VOSviewer_network.png)\n", "\"\"\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }