{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ " ![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) WP2 [User Story 8](https://github.com/datacite/freya/issues/38) | As a longitudinal study, I want to be able to deduplicate the metrics/impact for our data, so that I can see the impact of our study’s data as a whole.\n", " :------------- | :------------- | :-------------\n", "\n", "Scientific datasets may be composed of individual components, whereby the parent and each component are identified by a different DOI and hence can be cited, viewed and downloaded individually. In order to assess the reuse such datasets, their authors must be able to aggregate views, downloads and citations metrics across all the dataset components.

\n", "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve all parts of the dataset: [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014), so that its overall impact can be quantified.\n", "\n", "**Goal**: By the end of this notebook, for a given dataset with constituent parts, you should be able to display:\n", "- Counts of citations, views and downloads metrics, aggregated across the parent dataset and all its parts;\n", "- An interactive stacked bar plot showing how the metric counts of each part contribute to the corresponding aggregated metric counts, e.g.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install libraries and prepare GraphQL client" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Install required Python packages\n", "!pip install gql requests numpy plotly" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Prepare the GraphQL client\n", "import requests\n", "from IPython.display import display, Markdown\n", "from gql import gql, Client\n", "from gql.transport.requests import RequestsHTTPTransport\n", "\n", "_transport = RequestsHTTPTransport(\n", " url='https://api.datacite.org/graphql',\n", " use_json=True,\n", ")\n", "\n", "client = Client(\n", " transport=_transport,\n", " fetch_schema_from_transport=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define and run GraphQL query\n", "Define the GraphQL query to retrieve [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Generate the GraphQL query to retrieve all parts of the 2014 TCCON Data Release dataset.\n", "query_params = {\n", " \"datasetId\" : \"https://doi.org/10.14291/tccon.ggg2014\"\n", "}\n", "\n", "query = gql(\"\"\"query getDataset($datasetId: ID!)\n", "{\n", " dataset(id: $datasetId) {\n", " id\n", " titles {\n", " title\n", " }\n", " publicationYear\n", " descriptions {\n", " description\n", " descriptionType\n", " }\n", " citationCount\n", " viewCount\n", " downloadCount\n", " partCount\n", " parts {\n", " nodes {\n", " id\n", " titles {\n", " title\n", " }\n", " publicationYear\n", " descriptions {\n", " description\n", " descriptionType\n", " }\n", " citationCount\n", " viewCount\n", " downloadCount\n", " }\n", " }\n", " }\n", "}\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the above query via the GraphQL client" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import json\n", "data = client.execute(query, variable_values=json.dumps(query_params))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display total dataset metrics\n", "Display total number of citations, views and downloads of [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014), aggregated across all the parts." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Aggregated metric counts for [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) and its 2 parts:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "|Metric | Aggregated Count|\n", "|---|---|\n", "citationCount | **6**\n", "viewCount | **210**\n", "downloadCount | **54**\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get the total count per metric, aggregated for the parent dataset and all its parts\n", "dataset = data['dataset']\n", "# Initialise metric counts for the parent dataset\n", "metricCounts = {}\n", "for metric in ['citationCount', 'viewCount', 'downloadCount']:\n", " metricCounts[metric] = dataset[metric]\n", " \n", "# Aggregate metric counts across all the parts\n", "for node in dataset['parts']['nodes']:\n", " for metric in metricCounts:\n", " metricCounts[metric] += node[metric]\n", " \n", "# Display the aggregated metric counts\n", "tableBody=\"\"\n", "for metric in metricCounts: \n", " tableBody += \"%s | **%s**\\n\" % (metric, str(metricCounts[metric]))\n", "if tableBody:\n", " display(Markdown(\"Aggregated metric counts for [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) and its %d parts:\" % dataset['partCount']))\n", " display(Markdown(\"|Metric | Aggregated Count|\\n|---|---|\\n%s\" % tableBody)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot metric counts per part\n", "Plot stacked bar plot showing how the individual parts of [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) contribute their metric counts to the corresponding aggregated total." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Citations, views and downloads counts for [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) and individual parts, shown as stacked bar plot:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import plotly.io as pio\n", "import plotly.express as px\n", "from IPython.display import IFrame\n", "import pandas as pd\n", "\n", "# Adapted from: https://stackoverflow.com/questions/58766305/is-there-any-way-to-implement-stacked-or-grouped-bar-charts-in-plotly-express\n", "def px_stacked_bar(df, color_name='Metric', y_name='Metrics', **pxargs):\n", " idx_col = df.index.name\n", " m = pd.melt(df.reset_index(), id_vars=idx_col, var_name=color_name, value_name=y_name)\n", " # For Plotly colour sequences see: https://plotly.com/python/discrete-color/ \n", " return px.bar(m, x=idx_col, y=y_name, color=color_name, **pxargs, \n", " color_discrete_sequence=px.colors.qualitative.Pastel1)\n", "\n", "# Collect metric counts\n", "dataset = data['dataset']\n", "numParts = dataset['partCount']\n", "\n", "# Initialise dicts for the stacked bar plot\n", "labels = {0: 'Dataset and Parts', 1: 'Dataset (%s)' % dataset['publicationYear']}\n", "citationCounts = {}\n", "viewCounts = {}\n", "downloadCounts = {}\n", "\n", "# Collect dataset/part labels\n", "partCnt = 0\n", "for node in dataset['parts']['nodes']:\n", " labels[2 + partCnt] = 'Part %d (%s)' % ((partCnt + 1), node['publicationYear'])\n", " partCnt += 1\n", " \n", "# Initialise aggregated metric counts (key: 0) and populate parent dataset metric counts (key: 1)\n", "i = 0\n", "while (i < 2):\n", " citationCounts[i] = dataset['citationCount']\n", " viewCounts[i] = dataset['viewCount']\n", " downloadCounts[i] = dataset['downloadCount']\n", " i += 1\n", " \n", "# Populate metric counts for individual parts (key: 2 + partCnt) and add them to the aggregated counts (key: 0)\n", "partCnt = 0\n", "for node in dataset['parts']['nodes']:\n", " citationCounts[0] += node['citationCount']\n", " viewCounts[0] += node['viewCount']\n", " downloadCounts[0] += node['downloadCount']\n", " citationCounts[2 + partCnt] = node['citationCount']\n", " viewCounts[2 + partCnt] = node['viewCount']\n", " downloadCounts[2 + partCnt] = node['downloadCount']\n", " partCnt += 1\n", "\n", "# Create stacked bar plot\n", "df = pd.DataFrame({'Dataset/Parts': labels,\n", " 'Citations': citationCounts,\n", " 'Views': viewCounts,\n", " 'Downloads': downloadCounts})\n", "fig = px_stacked_bar(df.set_index('Dataset/Parts'), y_name = \"Counts\")\n", "\n", "# Set plot background to transparent\n", "fig.update_layout({\n", "'plot_bgcolor': 'rgba(0, 0, 0, 0)',\n", "'paper_bgcolor': 'rgba(0, 0, 0, 0)'\n", "})\n", "\n", "# Write interactive plot out to html file\n", "pio.write_html(fig, file='out.html')\n", "\n", "# Display plot from the saved html file\n", "display(Markdown(\"Citations, views and downloads counts for [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) and individual parts, shown as stacked bar plot:\"))\n", "IFrame(src=\"./out.html\", width=500, height=500)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }