{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) WP2 [User Story 5](https://github.com/datacite/freya/issues/35) | As a student using the British Library's EThOS database, I want to be able to find all dissertations on a given topic.  \n",
    " :------------- | :------------- | :-------------\n",
    "\n",
    "It is important for postgraduate students to identify easily existing dissertations on a research topic of interest.<p />\n",
    "This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve all dissertations for three different queries: *Shakespeare*, *Machine learning* and *Ebola*. These queries illustrate trends in the number of dissertations created over time.\n",
    "\n",
    "**Goal**: By the end of this notebook you should be able to:\n",
    "- Retrieve all dissertations (across multiple repositories) matching a specific query; \n",
    "- For each query:\n",
    " - Display a bar plot of the number of dissertations per year, including a trend line, e.g. <br> <img src=\"example_plot.png\" width=\"290\" height=\"163\" />\n",
    " - Display a pie chart showing the number of dissertations per repository;\n",
    " - Display a word cloud of words from dissertation titles and descriptions, e.g. <br> <img src=\"example_plot1.png\" width=\"220\" height=\"216\" />\n",
    " - Download all dissertations in a single BibTeX file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install libraries and prepare GraphQL client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Install required Python packages\n",
    "!pip install gql requests sklearn wordcloud numpy pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare the GraphQL client\n",
    "import requests\n",
    "from IPython.display import display, Markdown\n",
    "from gql import gql, Client\n",
    "from gql.transport.requests import RequestsHTTPTransport\n",
    "\n",
    "_transport = RequestsHTTPTransport(\n",
    "    url='https://api.datacite.org/graphql',\n",
    "    use_json=True,\n",
    ")\n",
    "\n",
    "client = Client(\n",
    "    transport=_transport,\n",
    "    fetch_schema_from_transport=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define and run GraphQL query\n",
    "Define the GraphQL query to retrieve all dissertations using three different queries (that yield distinct trends in number of dissertations across time): *shakespeare*, *Machine learning* and *ebola*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the GraphQL query: retrieve all dissertations for three different queries: \n",
    "# Shakespeare, Machine learning and Ebola.\n",
    "query_params = {\n",
    "    \"query1\" : \"shakespeare\",\n",
    "    \"query2\" : \"Machine learning\",\n",
    "    \"query3\" : \"ebola\",\n",
    "    \"query1_end_cursor\" : \"\",\n",
    "    \"query2_end_cursor\" : \"\",\n",
    "    \"query3_end_cursor\" : \"\",\n",
    "    \"max_dissertations\" : 100\n",
    "}\n",
    "\n",
    "queryStr = \"\"\"query getDissertationsByQuery(\n",
    "    $query1: String!,\n",
    "    $query2: String!,\n",
    "    $query3: String!,\n",
    "    $query1_end_cursor: String!,\n",
    "    $query2_end_cursor: String!,\n",
    "    $query3_end_cursor: String!,\n",
    "    $max_dissertations: Int!\n",
    "    )\n",
    "{\n",
    "  query1: dissertations(query: $query1, first: $max_dissertations, after: $query1_end_cursor) {\n",
    "    totalCount\n",
    "    pageInfo {\n",
    "      hasNextPage\n",
    "      endCursor\n",
    "    }    \n",
    "    published {\n",
    "      count\n",
    "      title\n",
    "    }\n",
    "    nodes {\n",
    "      id\n",
    "      titles {\n",
    "        title\n",
    "      }\n",
    "      descriptions {\n",
    "         description\n",
    "      }\n",
    "      repository {\n",
    "        name\n",
    "      }\n",
    "      versionOfCount\n",
    "      identifiers {\n",
    "        identifier\n",
    "      }\n",
    "      publicationYear\n",
    "      bibtex\n",
    "      repository {\n",
    "        id\n",
    "      }\n",
    "      publisher\n",
    "      creators {\n",
    "        id\n",
    "        name\n",
    "      }\n",
    "    }\n",
    "  },  \n",
    "  query2: dissertations(query: $query2, first: $max_dissertations, after: $query2_end_cursor) {\n",
    "    totalCount\n",
    "    pageInfo {\n",
    "      hasNextPage\n",
    "      endCursor\n",
    "    }     \n",
    "    published {\n",
    "      count\n",
    "      title\n",
    "    }\n",
    "    nodes {\n",
    "      id\n",
    "      titles {\n",
    "        title\n",
    "      }\n",
    "      descriptions {\n",
    "         description\n",
    "      }\n",
    "      repository {\n",
    "        name\n",
    "      }      \n",
    "      versionOfCount\n",
    "      identifiers {\n",
    "        identifier\n",
    "      }\n",
    "      publicationYear\n",
    "      bibtex\n",
    "      repository {\n",
    "        id\n",
    "      }\n",
    "      publisher\n",
    "      creators {\n",
    "        id\n",
    "        name\n",
    "      }\n",
    "    }\n",
    "  },\n",
    "  query3: dissertations(query: $query3, first: $max_dissertations, after: $query3_end_cursor) {\n",
    "    totalCount\n",
    "    pageInfo {\n",
    "      hasNextPage\n",
    "      endCursor\n",
    "    }     \n",
    "    published {\n",
    "      count\n",
    "      title\n",
    "    }\n",
    "    nodes {\n",
    "      id\n",
    "      titles {\n",
    "        title\n",
    "      }\n",
    "      descriptions {\n",
    "         description\n",
    "      }\n",
    "      repository {\n",
    "        name\n",
    "      }      \n",
    "      versionOfCount\n",
    "      bibtex\n",
    "      identifiers {\n",
    "        identifier\n",
    "      }\n",
    "      publicationYear\n",
    "      repository {\n",
    "        id\n",
    "      }\n",
    "      publisher\n",
    "      creators {\n",
    "        id\n",
    "        name\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "}\n",
    "\"\"\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the above query via the GraphQL client."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "found_next_page = True\n",
    "\n",
    "# queries_with_more_results controls data for which query still needs to be collected from retrieved results\n",
    "queries_with_more_results = ['query1', 'query2', 'query3']\n",
    "# Initialise overall data dict that will store results across all queries\n",
    "data = {}\n",
    "\n",
    "# Keep retrieving results until there are no more results left for any of the three queries\n",
    "while len(queries_with_more_results) > 0:\n",
    "    query = gql(\"%s\" % queryStr)\n",
    "    res = client.execute(query, variable_values=json.dumps(query_params))\n",
    "    for query in queries_with_more_results:\n",
    "        if query not in data:\n",
    "            data[query] = res[query]\n",
    "        else:\n",
    "            data[query][\"nodes\"].extend(res[query][\"nodes\"])\n",
    "        \n",
    "    for query in ['query1', 'query2', 'query3']:\n",
    "        if query not in queries_with_more_results:\n",
    "            continue\n",
    "        cursor_params_key = query + \"_end_cursor\"        \n",
    "        dissertations = res[query]\n",
    "        pageInfo = dissertations[\"pageInfo\"]\n",
    "        if pageInfo[\"hasNextPage\"]:\n",
    "            if pageInfo[\"endCursor\"] is not None:\n",
    "                query_params[cursor_params_key] = pageInfo[\"endCursor\"]            \n",
    "            else:\n",
    "                query_params[cursor_params_key] = \"\"\n",
    "                queries_with_more_results.remove(query)\n",
    "        else:\n",
    "            query_params[cursor_params_key] = \"\"\n",
    "            queries_with_more_results.remove(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display total number of dissertations\n",
    "For each query, display the total number of dissertations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the total number of dissertations per query\n",
    "for query in ['query1', 'query2', 'query3']:\n",
    "    print(\"The total number of dissertations for query '%s':\\n%s\" % (query_params[query], str(data[query]['totalCount'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot number of dissertations per year\n",
    "For each query, display a bar plot of number of dissertations per year, between *start_year* and *end_year* defined in code below. Also shown is a trend line, highlighting the general direction of change in dissertation numbers between *start_year* and *end_year*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the number of dissertations by year\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.ticker import FormatStrFormatter\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "start_year = 1990\n",
    "end_year = 2020\n",
    "for query in ['query1', 'query2', 'query3']:\n",
    "    dissertations = data[query]\n",
    "    queryName = query_params[query]\n",
    "    plt.rcdefaults()\n",
    "    years = [int(s['title']) for s in dissertations['published']]\n",
    "    num_outputs4years = [s['count'] for s in dissertations['published']]\n",
    "    # Get a list of all consecutive years between min and max year (inclusive)\n",
    "    all_years = list(range(start_year, end_year))\n",
    "    # Populate output counts (into num_counts) for all consecutive years\n",
    "    num_outputs = []\n",
    "    for year in all_years:\n",
    "        if year in years:\n",
    "            idx = years.index(year)\n",
    "            num_outputs.append(num_outputs4years[idx])\n",
    "        else:\n",
    "            num_outputs.append(0)     \n",
    "\n",
    "    df = pd.DataFrame({'Year': all_years, 'Count': num_outputs} )\n",
    "    # Create trend line for the plot     \n",
    "    lr = Ridge()\n",
    "    lr.fit(df[['Year']], df['Count'])\n",
    "\n",
    "    fig, ax = plt.subplots(1, 1, figsize = (10, 5))\n",
    "    ax.bar(df['Year'],  df['Count'], align='center', color='blue', edgecolor='black', linewidth=1, alpha=0.5)\n",
    "    ax.set_xticks(df['Year'])\n",
    "    ax.set_xticklabels(all_years, rotation='vertical')\n",
    "    ax.set_ylabel('Number of dissertations per Year')\n",
    "    ax.set_xlabel('Year')\n",
    "    ax.set_title(\"Number of dissertations found by query: '%s' since %s, with trend line\" % (query_params[query], str(start_year)))\n",
    "    ax.plot(df['Year'], lr.coef_*df['Year']+lr.intercept_, color='orange')\n",
    "    plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot number of dissertations per repository\n",
    "For each query, display a pie chart showing the number of dissertations per repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot a pie chart of dissertation counts per repository\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.ticker import FormatStrFormatter\n",
    "import numpy as np\n",
    "import operator\n",
    "\n",
    "for query in ['query1', 'query2', 'query3']:\n",
    "# for query in ['query1']:\n",
    "    num_outputs_dict = {}\n",
    "    dissertations = data[query]\n",
    "    for r in dissertations['nodes']:\n",
    "        repo = r['repository']['name']\n",
    "        if repo not in num_outputs_dict:\n",
    "            num_outputs_dict[repo] = 0\n",
    "        num_outputs_dict[repo] += 1\n",
    "    \n",
    "    # Sort resource types by count of work desc\n",
    "    sorted_num_outputs = sorted(num_outputs_dict.items(),key=operator.itemgetter(1),reverse=True)\n",
    "    # Populate lists needed for pie chart\n",
    "    repositories = [s[0] for s in sorted_num_outputs] \n",
    "    num_outputs = [s[1] for s in sorted_num_outputs] \n",
    "\n",
    "    # Generate a pie chart of number of grant outputs by resource type\n",
    "    fig = plt.figure()\n",
    "    ax = fig.add_axes([0,0,1,1])\n",
    "    ax.set_title(\"Number of Dissertations found by query: '%s' Per Repository\" % query_params[query])\n",
    "    ax.axis('equal')\n",
    "    ax.pie(num_outputs, labels = repositories, autopct='%1.0f%%')\n",
    "    plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Display a word cloud of dissertation titles and descriptions\n",
    "For each query, display a wordcloud of words in dissertation titles and descriptions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from wordcloud import WordCloud, STOPWORDS \n",
    "import matplotlib.pyplot as plt \n",
    "import pandas as pd\n",
    "\n",
    "stopWords = set(STOPWORDS)\n",
    "stopWords.add('_')\n",
    "\n",
    "for query in ['query1', 'query2', 'query3']:\n",
    "    titleWords=[]\n",
    "    dissertations = data[query]\n",
    "    for r in dissertations['nodes']:\n",
    "        for title in r['titles']:\n",
    "            tokens = [t.lower() for t in str(title['title']).split()] \n",
    "        for title in r['descriptions']:\n",
    "            tokens = [t.lower() for t in str(title['description']).split()]     \n",
    "        titleWords += tokens\n",
    "     \n",
    "    x, y = np.ogrid[:800, :800]\n",
    "    mask = (x - 400) ** 2 + (y - 400) ** 2 > 345 ** 2\n",
    "    mask = 255 * mask.astype(int)\n",
    "    \n",
    "    wordcloud = WordCloud(width = 600, height = 600, \n",
    "                background_color ='white', \n",
    "                stopwords = stopWords, \n",
    "                min_font_size = 10, \n",
    "                mask = mask).generate(\" \".join(titleWords))\n",
    "    \n",
    "    fig, ax = plt.subplots(1, 1, figsize = (8, 8), facecolor = None)\n",
    "    ax.set_title(\"Word cloud of titles of maximum %d dissertations found by query: '%s'\" % (query_params['max_dissertations'],query_params[query]))\n",
    "    plt.imshow(wordcloud, interpolation=\"bilinear\") \n",
    "    plt.axis(\"off\") \n",
    "    plt.tight_layout(pad = 0)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download file of dissertation entries in BibTeX format\n",
    "For each query, download a file of dissertation entries in BibTeX format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import Javascript\n",
    "from requests.utils import requote_uri\n",
    "\n",
    "# For each query, download a file containing BibTeX entries of all dissertations\n",
    "for query in ['query1', 'query2', 'query3']:    \n",
    "    dissertations = data[query]\n",
    "    bibtex_data = []\n",
    "    for r in dissertations['nodes']:\n",
    "        bibtex_data.append([r['bibtex']])\n",
    "    df = pd.DataFrame(bibtex_data, columns = None)\n",
    "\n",
    "    js_download = \"\"\"\n",
    "var csv = '%s';\n",
    "\n",
    "var filename = '%s.bib';\n",
    "var blob = new Blob([csv], { type: 'application/x-bibtex;charset=utf-8;' });\n",
    "if (navigator.msSaveBlob) { // IE 10+\n",
    "    navigator.msSaveBlob(blob, filename);\n",
    "} else {\n",
    "    var link = document.createElement(\"a\");\n",
    "    if (link.download !== undefined) { // feature detection\n",
    "        // Browsers that support HTML5 download attribute\n",
    "        var url = URL.createObjectURL(blob);\n",
    "        link.setAttribute(\"href\", url);\n",
    "        link.setAttribute(\"download\", filename);\n",
    "        link.style.visibility = 'hidden';\n",
    "        document.body.appendChild(link);\n",
    "        link.click();\n",
    "        document.body.removeChild(link);\n",
    "    }\n",
    "}\n",
    "\"\"\" % (df.to_csv(index=False, header=False).replace('\\n','\\\\n').replace(\"\\'\",\"\\\\'\").replace(\"\\\"\",\"\").replace(\"\\r\",\"\"), requote_uri(query_params[query]))\n",
    "    \n",
    "    display(Javascript(js_download))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}