{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ " ![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) WP2 [User Story 5](https://github.com/datacite/freya/issues/35) | As a student using the British Library's EThOS database, I want to be able to find all dissertations on a given topic. \n", " :------------- | :------------- | :-------------\n", "\n", "It is important for postgraduate students to identify easily existing dissertations on a research topic of interest.

\n", "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve all dissertations for three different queries: *Shakespeare*, *Machine learning* and *Ebola*. These queries illustrate trends in the number of dissertations created over time.\n", "\n", "**Goal**: By the end of this notebook you should be able to:\n", "- Retrieve all dissertations (across multiple repositories) matching a specific query; \n", "- For each query:\n", " - Display a bar plot of the number of dissertations per year, including a trend line, e.g.
\n", " - Display a pie chart showing the number of dissertations per repository;\n", " - Display a word cloud of words from dissertation titles and descriptions, e.g.
\n", " - Download all dissertations in a single BibTeX file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install libraries and prepare GraphQL client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Install required Python packages\n", "!pip install gql requests sklearn wordcloud numpy pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prepare the GraphQL client\n", "import requests\n", "from IPython.display import display, Markdown\n", "from gql import gql, Client\n", "from gql.transport.requests import RequestsHTTPTransport\n", "\n", "_transport = RequestsHTTPTransport(\n", " url='https://api.datacite.org/graphql',\n", " use_json=True,\n", ")\n", "\n", "client = Client(\n", " transport=_transport,\n", " fetch_schema_from_transport=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define and run GraphQL query\n", "Define the GraphQL query to retrieve all dissertations using three different queries (that yield distinct trends in number of dissertations across time): *shakespeare*, *Machine learning* and *ebola*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the GraphQL query: retrieve all dissertations for three different queries: \n", "# Shakespeare, Machine learning and Ebola.\n", "query_params = {\n", " \"query1\" : \"shakespeare\",\n", " \"query2\" : \"Machine learning\",\n", " \"query3\" : \"ebola\",\n", " \"query1_end_cursor\" : \"\",\n", " \"query2_end_cursor\" : \"\",\n", " \"query3_end_cursor\" : \"\",\n", " \"max_dissertations\" : 100\n", "}\n", "\n", "queryStr = \"\"\"query getDissertationsByQuery(\n", " $query1: String!,\n", " $query2: String!,\n", " $query3: String!,\n", " $query1_end_cursor: String!,\n", " $query2_end_cursor: String!,\n", " $query3_end_cursor: String!,\n", " $max_dissertations: Int!\n", " )\n", "{\n", " query1: dissertations(query: $query1, first: $max_dissertations, after: $query1_end_cursor) {\n", " totalCount\n", " pageInfo {\n", " hasNextPage\n", " endCursor\n", " } \n", " published {\n", " count\n", " title\n", " }\n", " nodes {\n", " id\n", " titles {\n", " title\n", " }\n", " descriptions {\n", " description\n", " }\n", " repository {\n", " name\n", " }\n", " versionOfCount\n", " identifiers {\n", " identifier\n", " }\n", " publicationYear\n", " bibtex\n", " repository {\n", " id\n", " }\n", " publisher\n", " creators {\n", " id\n", " name\n", " }\n", " }\n", " }, \n", " query2: dissertations(query: $query2, first: $max_dissertations, after: $query2_end_cursor) {\n", " totalCount\n", " pageInfo {\n", " hasNextPage\n", " endCursor\n", " } \n", " published {\n", " count\n", " title\n", " }\n", " nodes {\n", " id\n", " titles {\n", " title\n", " }\n", " descriptions {\n", " description\n", " }\n", " repository {\n", " name\n", " } \n", " versionOfCount\n", " identifiers {\n", " identifier\n", " }\n", " publicationYear\n", " bibtex\n", " repository {\n", " id\n", " }\n", " publisher\n", " creators {\n", " id\n", " name\n", " }\n", " }\n", " },\n", " query3: dissertations(query: $query3, first: $max_dissertations, after: $query3_end_cursor) {\n", " totalCount\n", " pageInfo {\n", " hasNextPage\n", " endCursor\n", " } \n", " published {\n", " count\n", " title\n", " }\n", " nodes {\n", " id\n", " titles {\n", " title\n", " }\n", " descriptions {\n", " description\n", " }\n", " repository {\n", " name\n", " } \n", " versionOfCount\n", " bibtex\n", " identifiers {\n", " identifier\n", " }\n", " publicationYear\n", " repository {\n", " id\n", " }\n", " publisher\n", " creators {\n", " id\n", " name\n", " }\n", " }\n", " }\n", "}\n", "\"\"\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the above query via the GraphQL client." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "found_next_page = True\n", "\n", "# queries_with_more_results controls data for which query still needs to be collected from retrieved results\n", "queries_with_more_results = ['query1', 'query2', 'query3']\n", "# Initialise overall data dict that will store results across all queries\n", "data = {}\n", "\n", "# Keep retrieving results until there are no more results left for any of the three queries\n", "while len(queries_with_more_results) > 0:\n", " query = gql(\"%s\" % queryStr)\n", " res = client.execute(query, variable_values=json.dumps(query_params))\n", " for query in queries_with_more_results:\n", " if query not in data:\n", " data[query] = res[query]\n", " else:\n", " data[query][\"nodes\"].extend(res[query][\"nodes\"])\n", " \n", " for query in ['query1', 'query2', 'query3']:\n", " if query not in queries_with_more_results:\n", " continue\n", " cursor_params_key = query + \"_end_cursor\" \n", " dissertations = res[query]\n", " pageInfo = dissertations[\"pageInfo\"]\n", " if pageInfo[\"hasNextPage\"]:\n", " if pageInfo[\"endCursor\"] is not None:\n", " query_params[cursor_params_key] = pageInfo[\"endCursor\"] \n", " else:\n", " query_params[cursor_params_key] = \"\"\n", " queries_with_more_results.remove(query)\n", " else:\n", " query_params[cursor_params_key] = \"\"\n", " queries_with_more_results.remove(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display total number of dissertations\n", "For each query, display the total number of dissertations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the total number of dissertations per query\n", "for query in ['query1', 'query2', 'query3']:\n", " print(\"The total number of dissertations for query '%s':\\n%s\" % (query_params[query], str(data[query]['totalCount'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot number of dissertations per year\n", "For each query, display a bar plot of number of dissertations per year, between *start_year* and *end_year* defined in code below. Also shown is a trend line, highlighting the general direction of change in dissertation numbers between *start_year* and *end_year*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the number of dissertations by year\n", "import matplotlib.pyplot as plt\n", "from matplotlib.ticker import FormatStrFormatter\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.linear_model import Ridge\n", "\n", "start_year = 1990\n", "end_year = 2020\n", "for query in ['query1', 'query2', 'query3']:\n", " dissertations = data[query]\n", " queryName = query_params[query]\n", " plt.rcdefaults()\n", " years = [int(s['title']) for s in dissertations['published']]\n", " num_outputs4years = [s['count'] for s in dissertations['published']]\n", " # Get a list of all consecutive years between min and max year (inclusive)\n", " all_years = list(range(start_year, end_year))\n", " # Populate output counts (into num_counts) for all consecutive years\n", " num_outputs = []\n", " for year in all_years:\n", " if year in years:\n", " idx = years.index(year)\n", " num_outputs.append(num_outputs4years[idx])\n", " else:\n", " num_outputs.append(0) \n", "\n", " df = pd.DataFrame({'Year': all_years, 'Count': num_outputs} )\n", " # Create trend line for the plot \n", " lr = Ridge()\n", " lr.fit(df[['Year']], df['Count'])\n", "\n", " fig, ax = plt.subplots(1, 1, figsize = (10, 5))\n", " ax.bar(df['Year'], df['Count'], align='center', color='blue', edgecolor='black', linewidth=1, alpha=0.5)\n", " ax.set_xticks(df['Year'])\n", " ax.set_xticklabels(all_years, rotation='vertical')\n", " ax.set_ylabel('Number of dissertations per Year')\n", " ax.set_xlabel('Year')\n", " ax.set_title(\"Number of dissertations found by query: '%s' since %s, with trend line\" % (query_params[query], str(start_year)))\n", " ax.plot(df['Year'], lr.coef_*df['Year']+lr.intercept_, color='orange')\n", " plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot number of dissertations per repository\n", "For each query, display a pie chart showing the number of dissertations per repository." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot a pie chart of dissertation counts per repository\n", "import matplotlib.pyplot as plt\n", "from matplotlib.ticker import FormatStrFormatter\n", "import numpy as np\n", "import operator\n", "\n", "for query in ['query1', 'query2', 'query3']:\n", "# for query in ['query1']:\n", " num_outputs_dict = {}\n", " dissertations = data[query]\n", " for r in dissertations['nodes']:\n", " repo = r['repository']['name']\n", " if repo not in num_outputs_dict:\n", " num_outputs_dict[repo] = 0\n", " num_outputs_dict[repo] += 1\n", " \n", " # Sort resource types by count of work desc\n", " sorted_num_outputs = sorted(num_outputs_dict.items(),key=operator.itemgetter(1),reverse=True)\n", " # Populate lists needed for pie chart\n", " repositories = [s[0] for s in sorted_num_outputs] \n", " num_outputs = [s[1] for s in sorted_num_outputs] \n", "\n", " # Generate a pie chart of number of grant outputs by resource type\n", " fig = plt.figure()\n", " ax = fig.add_axes([0,0,1,1])\n", " ax.set_title(\"Number of Dissertations found by query: '%s' Per Repository\" % query_params[query])\n", " ax.axis('equal')\n", " ax.pie(num_outputs, labels = repositories, autopct='%1.0f%%')\n", " plt.show()\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display a word cloud of dissertation titles and descriptions\n", "For each query, display a wordcloud of words in dissertation titles and descriptions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from wordcloud import WordCloud, STOPWORDS \n", "import matplotlib.pyplot as plt \n", "import pandas as pd\n", "\n", "stopWords = set(STOPWORDS)\n", "stopWords.add('_')\n", "\n", "for query in ['query1', 'query2', 'query3']:\n", " titleWords=[]\n", " dissertations = data[query]\n", " for r in dissertations['nodes']:\n", " for title in r['titles']:\n", " tokens = [t.lower() for t in str(title['title']).split()] \n", " for title in r['descriptions']:\n", " tokens = [t.lower() for t in str(title['description']).split()] \n", " titleWords += tokens\n", " \n", " x, y = np.ogrid[:800, :800]\n", " mask = (x - 400) ** 2 + (y - 400) ** 2 > 345 ** 2\n", " mask = 255 * mask.astype(int)\n", " \n", " wordcloud = WordCloud(width = 600, height = 600, \n", " background_color ='white', \n", " stopwords = stopWords, \n", " min_font_size = 10, \n", " mask = mask).generate(\" \".join(titleWords))\n", " \n", " fig, ax = plt.subplots(1, 1, figsize = (8, 8), facecolor = None)\n", " ax.set_title(\"Word cloud of titles of maximum %d dissertations found by query: '%s'\" % (query_params['max_dissertations'],query_params[query]))\n", " plt.imshow(wordcloud, interpolation=\"bilinear\") \n", " plt.axis(\"off\") \n", " plt.tight_layout(pad = 0)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download file of dissertation entries in BibTeX format\n", "For each query, download a file of dissertation entries in BibTeX format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from IPython.display import Javascript\n", "from requests.utils import requote_uri\n", "\n", "# For each query, download a file containing BibTeX entries of all dissertations\n", "for query in ['query1', 'query2', 'query3']: \n", " dissertations = data[query]\n", " bibtex_data = []\n", " for r in dissertations['nodes']:\n", " bibtex_data.append([r['bibtex']])\n", " df = pd.DataFrame(bibtex_data, columns = None)\n", "\n", " js_download = \"\"\"\n", "var csv = '%s';\n", "\n", "var filename = '%s.bib';\n", "var blob = new Blob([csv], { type: 'application/x-bibtex;charset=utf-8;' });\n", "if (navigator.msSaveBlob) { // IE 10+\n", " navigator.msSaveBlob(blob, filename);\n", "} else {\n", " var link = document.createElement(\"a\");\n", " if (link.download !== undefined) { // feature detection\n", " // Browsers that support HTML5 download attribute\n", " var url = URL.createObjectURL(blob);\n", " link.setAttribute(\"href\", url);\n", " link.setAttribute(\"download\", filename);\n", " link.style.visibility = 'hidden';\n", " document.body.appendChild(link);\n", " link.click();\n", " document.body.removeChild(link);\n", " }\n", "}\n", "\"\"\" % (df.to_csv(index=False, header=False).replace('\\n','\\\\n').replace(\"\\'\",\"\\\\'\").replace(\"\\\"\",\"\").replace(\"\\r\",\"\"), requote_uri(query_params[query]))\n", " \n", " display(Javascript(js_download))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }