{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) | WP2 [User Story 7]( https://www.pidforum.org/t/pid-graph-graphql-example-second-degree-citations/939): As a data center, I want to see the citations of publications that use my repository for the underlying data, so that I can demonstrate the impact of our repository.\n", ":------------- | :------------- | :-------------\n", "\n", "It is important for repositories of scientific data to monitor and report on the impact of the data they store. One useful proxy of that impact are secondary citations, i.e. citations of publications which use the deposited data. This notebook focuses on visualisation of these citations by means of a force-directed graph.

\n", "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve the citations of the following different datasets: \n", "- [Effects of varying food-availability on ecology and distribution of smallest benthic organisms in sediments of the arctic Fram Strait during POLARSTERN cruise ARK-XV/2, supplement to: Schewe, Ingo; Soltwedel, Thomas (2003): Benthic response to ice-edge-induced particle flux in the Arctic Ocean. Polar Biology, 26(9), 610-620](https://doi.org/10.1594/pangaea.314690);\n", "- [Data from: Towards a worldwide wood economics spectrum](https://doi.org/10.5061/dryad.234); and\n", "- [rmca-albertine-rift-cichlids](https://doi.org/10.15468/n6ftyd).\n", "\n", "**Goal**: By the end of this notebook, for a given list of datasets, you should be able to display:\n", "- Total citation count for each retrieved dataset;\n", "- An interactive force-directed graph of the datasets and their citations, in which:\n", " - Pink nodes at the centre of each radial shape corresponds to a dataset;\n", " - Blue nodes correspond to citations (note that some citations may be shared by more than one dataset);\n", " - Larger node size represents more citations of the dataset or citation represented by that node. Note that to increase node visibility, node sizes between datasets and citations are not comparable to each other.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install libraries and prepare GraphQL client" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Install required Python packages\n", "!pip install gql requests pyvis jsonpickle" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "# Prepare the GraphQL client\n", "import requests\n", "from IPython.display import display, Markdown\n", "from gql import gql, Client\n", "from gql.transport.requests import RequestsHTTPTransport\n", "\n", "_transport = RequestsHTTPTransport(\n", " url='https://api.datacite.org/graphql',\n", " use_json=True,\n", ")\n", "\n", "client = Client(\n", " transport=_transport,\n", " fetch_schema_from_transport=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define and run GraphQL query\n", "Define the GraphQL query to find all publications including co-authors for [Dr Sarah Teichmann](https://orcid.org/0000-0002-6294-6366):" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "# Generate the GraphQL query to retrieve up to 100 researchers matching query \"John and Smith\"\n", "query_params = {\n", " \"ids\" : [\"10.5061/dryad.234\",\"10.15468/n6ftyd\",\"10.1594/pangaea.314690\"]\n", "}\n", "\n", "query = gql(\"\"\"query getDatasetCitations($ids: [String!]) {\n", " datasets(ids: $ids) {\n", " nodes {\n", " id\n", " titles {\n", " title\n", " }\n", " citationCount\n", " citations {\n", " nodes {\n", " id\n", " publisher\n", " titles {\n", " title\n", " }\n", " citationCount\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the above query via the GraphQL client" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "import json\n", "data = client.execute(query, variable_values=json.dumps(query_params))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display total number of citations per dataset" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "| Dataset | Citation Count|\n", "|---|---|\n", "[Effects of varying food-availability on ecology and distribution of smallest benthic organisms in sediments of the arctic Fram Strait during POLARSTERN cruise ARK-XV/2, supplement to: Schewe, Ingo; Soltwedel, Thomas (2003): Benthic response to ice-edge-induced particle flux in the Arctic Ocean. Polar Biology, 26(9), 610-620](https://doi.org/10.1594/pangaea.314690) | [**164**](https://search.datacite.org/works/10.1594/pangaea.314690)\n", "[Data from: Towards a worldwide wood economics spectrum](https://doi.org/10.5061/dryad.234) | [**4**](https://search.datacite.org/works/10.5061/dryad.234)\n", "[rmca-albertine-rift-cichlids](https://doi.org/10.15468/n6ftyd) | [**150**](https://search.datacite.org/works/10.15468/n6ftyd)\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get total citation counts for each dataset in the query\n", "datasets = data['datasets']\n", "tableBody=\"\"\n", "for dataset in datasets['nodes']:\n", " id = dataset['id']\n", " doi = \"/\".join(id.split(\"/\")[3:])\n", " titles = []\n", " for title in dataset['titles']:\n", " titles.append(title['title'])\n", " citationCount = dataset['citationCount']\n", " tableBody += \"[%s](%s) | [**%s**](%s/%s)\\n\" % (', '.join(titles), id, citationCount, \"https://search.datacite.org/works\",doi)\n", "if tableBody:\n", " display(Markdown(\"| Dataset | Citation Count|\\n|---|---|\\n%s\" % tableBody)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot a force-directed graph connecting datasets to their publications and citations of those publications\n", "Plot an interactive force-directed graph of connecting the datasets to their citations (first-degree) and the citations of those citations (second-degree).\n", " - Pink nodes at the centre of each radial shape corresponds to a dataset;\n", " - Blue nodes correspond to citations (note that some citations may be shared by more than one dataset);\n", " - Larger node size represents more citations of the dataset or citation represented by that node. Note that to increase node visibility, node sizes between datasets and citations are not comparable to each other." ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "N.B. Click on the plot, then use down/up mouse scroll to zoom in/out respectively.
When zoomed in, you will notice the DOI label against each node.
Click on any node to see the list of 'neighbour' citations, and on the citation node to also see the number of its citations." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyvis.network import Network\n", "import pandas as pd\n", "from IPython.display import IFrame\n", "import math\n", "\n", "# Colour swatch for the network nodes\n", "dataset_node_colour = \"#FB8072\"\n", "citation_node_colour = \"#80B1D3\"\n", "\n", "got_net = Network(height=\"750px\", width=\"100%\", bgcolor=\"#ffffff\", font_color=\"black\", notebook=True)\n", "got_net.options.edges.inherit_colors(False)\n", "\n", "# set the physics layout of the network\n", "got_net.barnes_hut()\n", "\n", "# ------------------------------\n", "# Initialise intermediate data structure to store: (src, trg) -> citation count of the target, where:\n", "# src - dataset or citation; trg - citation\n", "srcTrg2Count = {}\n", "# Initialise intermediate data structure to store: src --> Set of connected trg's\n", "# Note that the number of connected trgs will determine the colour of each src\n", "src2OtherTrgs = {}\n", "\n", "datasets = data['datasets']\n", "\n", "# Populate srcTrg2Count\n", "allNodes = set()\n", "for node in datasets['nodes']:\n", " nodeSet = set()\n", " datasetDOI = \"/\".join(node['id'].split(\"/\")[3:])\n", " nodeSet.add(datasetDOI)\n", " for citation in node['citations']['nodes']:\n", " citationDOI = \"/\".join(citation['id'].split(\"/\")[3:])\n", " citationCount = citation['citationCount']\n", " nodeSet.add(citationDOI)\n", " if datasetDOI not in src2OtherTrgs:\n", " src2OtherTrgs[datasetDOI] = set()\n", " src2OtherTrgs[datasetDOI].add(citationDOI)\n", " if citationDOI not in src2OtherTrgs:\n", " src2OtherTrgs[citationDOI] = set() \n", " src2OtherTrgs[citationDOI].add(datasetDOI) \n", " srcTrg2Count[(datasetDOI, citationDOI)] = citationCount \n", " nodes = sorted(list(nodeSet))\n", " allNodes.update(nodes)\n", "\n", "# Populate data structures needed for the graph\n", "sources, targets, weights = [], [], []\n", "for tuple in srcTrg2Count:\n", " if srcTrg2Count[tuple] >= 0:\n", " sources.append(tuple[0])\n", " targets.append(tuple[1])\n", " weights.append(srcTrg2Count[tuple])\n", "\n", "edge_data = zip(sources, targets, weights)\n", "\n", "for e in edge_data:\n", " src = e[0]\n", " dst = e[1]\n", " w = e[2]\n", " src_node_size = 5 * math.log2(len(src2OtherTrgs[src]) * 5000)\n", " got_net.add_node(src, src, title=\"Dataset: %s;\" % src, color=dataset_node_colour, size=src_node_size) \n", " # We're adding 1 below to make edges representing 0 citations of the target appear in the force-directed graph \n", " dst_node_size = 10 * math.log2((w+1) * 10)\n", " got_net.add_node(dst, dst, title=\"Citation: %s; Number of citations: %d;\" % (dst, w), color=citation_node_colour, size=dst_node_size)\n", " got_net.add_edge(src, dst, value=1)\n", " \n", "neighbor_map = got_net.get_adj_list()\n", "# add neighbor data to node hover data\n", "for node in got_net.nodes:\n", " node[\"title\"] += \" Neighbours:
\" + \"
\".join(neighbor_map[node[\"id\"]])\n", "\n", "got_net.show(\"out.html\")\n", "display(Markdown(\"N.B. Click on the plot, then use down/up mouse scroll to zoom in/out respectively.
When zoomed in, you will notice the DOI label against each node.
Click on any node to see the list of 'neighbour' citations, and on the citation node to also see the number of its citations.\"))\n", "IFrame(src=\"./out.html\", width=1000, height=800)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }