{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Harvest records from the NMA API\n",
    "\n",
    "The National Museum of Australia provides access to its [collection data through an API](https://www.nma.gov.au/about/our-collection/our-apis). But if you're going to do any large-scale analysis of the data, you probably want to harvest and save it locally. This notebook helps you do just that.\n",
    "\n",
    "According to the [API documentation](https://github.com/NationalMuseumAustralia/Collection-API/wiki/Getting-started), the possible endpoints are:\n",
    "\n",
    "* `/object` - the museum catalogue plus images/media\n",
    "* `/narrative` - narratives by Museum staff about featured topics\n",
    "* `/party` - people and organisations associated with collection items\n",
    "* `/place` - locations associated with collection items\n",
    "* `/collection` - sub-collections within the museum catalogue\n",
    "* `/media` - images and other media associated with collection items\n",
    "\n",
    "This notebook should harvest records from any of these endpoints, though I've only tested `object`, `party`, and `place`.\n",
    "\n",
    "It harvests records in the [simple JSON format](https://github.com/NationalMuseumAustralia/Collection-API/wiki/Getting-started#simple-json) and saves them as they are to a file-based database using [TinyDB](https://tinydb.readthedocs.io/en/latest/). See the other notebooks in this repository for examples of loading the JSON data into a DataFrame for manipulation and analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>\n",
    "\n",
    "<p>\n",
    "    Some tips:\n",
    "    <ul>\n",
    "        <li>Code cells have boxes around them.</li>\n",
    "        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>\n",
    "        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>\n",
    "        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>\n",
    "        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>\n",
    "    </ul>\n",
    "</p>\n",
    "\n",
    "<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href=\"https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fharvest_records.ipynb\">load a <b>live</b> version</a> running on Binder.</p>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import what we need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from tinydb import TinyDB, Query\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
    "s.mount('http://', HTTPAdapter(max_retries=retries))\n",
    "s.mount('https://', HTTPAdapter(max_retries=retries))\n",
    "\n",
    "API_BASE_URL = 'https://data.nma.gov.au/{}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set our API key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make full use of the NMA API and avoid rate limits, you should go and [get yourself and API key](https://www.nma.gov.au/about/our-collection/our-apis/register-for-an-api-key). Once you have your key, paste it in below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Paste your key in between the quotes\n",
    "API_KEY = 'YOUR API KEY'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create some functions to do the work"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_total(endpoint, params, headers):\n",
    "    '''\n",
    "    Get the total number of results.\n",
    "    '''\n",
    "    response = s.get(endpoint, headers=headers, params=params)\n",
    "    data = response.json()\n",
    "    return data['meta']['results']\n",
    "\n",
    "def harvest_records(record_type):\n",
    "    # Put api key in request headers\n",
    "    headers = {\n",
    "        'apikey': API_KEY\n",
    "    }\n",
    "    \n",
    "    # Set basic params\n",
    "    params = {\n",
    "        'text': '*',\n",
    "        'limit': 100, # Number of records per request\n",
    "        'offset': 0 # We'll change this as we loop through\n",
    "    }\n",
    "    \n",
    "    # Create a db to hold the results\n",
    "    db = TinyDB('nma_{}_db.json'.format(record_type))\n",
    "    \n",
    "    # Get the endpoint for this type of record\n",
    "    endpoint = API_BASE_URL.format(record_type)\n",
    "    \n",
    "    # Are there more records? We'll check this on each request.\n",
    "    more = True\n",
    "    \n",
    "    # Get the total number of records\n",
    "    total_records = get_total(endpoint, params, headers)\n",
    "    \n",
    "    # Make a progress bar\n",
    "    with tqdm(total=total_records) as pbar:\n",
    "        \n",
    "        # Continue while 'more' is True\n",
    "        while more:\n",
    "            \n",
    "            # Get the data\n",
    "            response = s.get(endpoint, headers=headers, params=params)\n",
    "            data = response.json()\n",
    "            \n",
    "            # Insert the records (in the 'data' field) into the db\n",
    "            db.insert_multiple(data['data'])\n",
    "            \n",
    "            # If there's not a 'next' link, set more to False\n",
    "            more = data.get('links', {}).get('next', False)\n",
    "            \n",
    "            # Update the offset value\n",
    "            params['offset'] += 100\n",
    "            \n",
    "            # Update the progress bar\n",
    "            pbar.update(len(data['data']))\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest records!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "harvest_records('place')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "harvest_records('party')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "harvest_records('object')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).\n",
    "\n",
    "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}