{ "cells": [ { "cell_type": "markdown", "id": "5f4a7d08-d2ed-4d84-adf2-d512c7cde35e", "metadata": {}, "source": [ "# Create a flat list of organisations contributing metadata to Trove\n", "\n", "The Trove API includes the `contributor` endpoint for retrieving information about organisations whose metadata is aggregated into Trove. If you include the `reclevel=full` parameter, you can get details of all contributors with a single API request like this:\n", "\n", "```\n", "https://api.trove.nla.gov.au/v2/contributor?encoding=json&reclevel=full&key=[YOUR API KEY]\n", "```\n", "\n", "However, the data can be difficult to use because of its nested structure, with some organisations having several levels of subsidiaries. There's also some inconsistency in the way nested records are named. This notebook aims to work around these problems by converting the nested data into a single flat list of organisations. \n", "\n", "This code is used to make weekly harvests of the contributor data which are saved [in this repository](https://github.com/wragge/trove-contributor-totals)." ] }, { "cell_type": "markdown", "id": "fd5d1c15-7324-42ec-822b-75600c2376a0", "metadata": {}, "source": [ "## Set things up" ] }, { "cell_type": "code", "execution_count": 170, "id": "6ceec3e3-ce45-4cad-a78a-00548cc85082", "metadata": { "tags": [] }, "outputs": [], "source": [ "import datetime\n", "import json\n", "import os\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "\n", "# Create a session that will automatically retry on server errors\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 171, "id": "3f47f03f-2a5e-477d-a4c4-3d7ff5c5a02a", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%capture\n", "# Load variables from the .env file if it exists\n", "# Use %%capture to suppress messages\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": 172, "id": "75f9a462-457f-483a-b9c3-5571290dbcaf", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "id": "08abaa9d-b9ef-4452-830d-b78ea164b70e", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 173, "id": "47745c20-08c4-433e-961f-6767af0b4a0a", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_contrib_details(record, parent=None):\n", " \"\"\"\n", " Get the details of a contributor, recursing through children if present.\n", " \"\"\"\n", " records = []\n", " # Get the basic details\n", " details = {\n", " \"id\": record[\"id\"],\n", " \"name\": record[\"name\"],\n", " \"total_items\": int(record[\"totalholdings\"]),\n", " \"parent\": None,\n", " }\n", " # Add nuc if present\n", " if \"nuc\" in record:\n", " details[\"nuc\"] = record[\"nuc\"][0]\n", " else:\n", " details[\"nuc\"] = None\n", " # If this is a child record, combine parent and child names\n", " if parent:\n", " if not record[\"name\"].startswith(parent[\"name\"]):\n", " details[\"name\"] = f\"{parent['name']} {record['name']}\"\n", " # Add parent id\n", " details[\"parent\"] = parent[\"id\"]\n", " records = [details]\n", " if \"children\" in record:\n", " # Pass forward combined names for deeply nested orgs\n", " record[\"name\"] = details[\"name\"]\n", " records += get_children(record)\n", " return records\n", "\n", "\n", "def get_children(parent):\n", " \"\"\"\n", " Process child records.\n", " \"\"\"\n", " children = []\n", " for child in parent[\"children\"][\"contributor\"]:\n", " children += get_contrib_details(child, parent)\n", " return children\n", "\n", "\n", "def get_contributors(save_json=True):\n", " \"\"\"\n", " Get all Trove contributors, flattening the nested structure and optionally saving the original JSON.\n", " \"\"\"\n", " contributors = []\n", " params = {\"encoding\": \"json\", \"reclevel\": \"full\", \"key\": API_KEY}\n", " response = s.get(\"https://api.trove.nla.gov.au/v2/contributor\", params=params)\n", " data = response.json()\n", " # Save the original nested JSON response\n", " if save_json:\n", " Path(\n", " f\"trove-contributors-{datetime.datetime.now().strftime('%Y%m%d')}.json\"\n", " ).write_text(json.dumps(data))\n", " # Get details of each contributor\n", " for contrib in data[\"response\"][\"contributor\"]:\n", " contributors += get_contrib_details(contrib)\n", " return contributors" ] }, { "cell_type": "markdown", "id": "05c89b1e-3710-4dec-8012-33afb4ae4022", "metadata": {}, "source": [ "## Get the data" ] }, { "cell_type": "code", "execution_count": 174, "id": "a07b3365-c86d-433d-a59e-34e5be4397dd", "metadata": { "tags": [] }, "outputs": [], "source": [ "contributors = get_contributors()" ] }, { "cell_type": "markdown", "id": "ecf9bdbb-dfab-42b7-baa4-8e4ed555cbc3", "metadata": {}, "source": [ "Convert the data to a dataframe." ] }, { "cell_type": "code", "execution_count": 175, "id": "fc4b05ac-a410-4f0d-81c3-0d19101cdb67", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnametotal_itemsparentnuc
0VPWLH4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.1570NoneVPWLH
1NBALAbbotsleigh. Betty Archdale Library.0NoneNBAL
2ADFAAcademy Library, UNSW Canberra.289474NoneADFA
3ACTACT Legislative Assembly Library.11290NoneACT
4SACCAdelaide City Libraries. Adelaide City Libraries - City Library.79056NoneSACC
\n", "
" ], "text/plain": [ " id name \\\n", "0 VPWLH 4th/19th Prince of Wales' Light Horse Regiment Unit. History Room. \n", "1 NBAL Abbotsleigh. Betty Archdale Library. \n", "2 ADFA Academy Library, UNSW Canberra. \n", "3 ACT ACT Legislative Assembly Library. \n", "4 SACC Adelaide City Libraries. Adelaide City Libraries - City Library. \n", "\n", " total_items parent nuc \n", "0 1570 None VPWLH \n", "1 0 None NBAL \n", "2 289474 None ADFA \n", "3 11290 None ACT \n", "4 79056 None SACC " ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(contributors)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "2b467ab4-e644-47e4-a357-94dd7cb046d1", "metadata": {}, "source": [ "How many contributors are listed?" ] }, { "cell_type": "code", "execution_count": 178, "id": "b8483f05-9013-4858-8137-90a6f33eadac", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "2693" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape[0]" ] }, { "cell_type": "markdown", "id": "edd67cd5-bc55-463c-b173-c57c80c5d997", "metadata": {}, "source": [ "How many of the contributor records include NUCs?" ] }, { "cell_type": "code", "execution_count": 179, "id": "7a25488f-85c7-419c-9f68-c9b5adf76870", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1765" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df[\"nuc\"].notnull()].shape[0]" ] }, { "cell_type": "markdown", "id": "7c10f796-70b3-49f5-807e-66ac1eec63e5", "metadata": {}, "source": [ "Save the data to a CSV file." ] }, { "cell_type": "code", "execution_count": 177, "id": "837939ed-4462-4391-83e8-114ce342d50d", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df[[\"id\", \"nuc\", \"name\", \"parent\", \"total_items\"]].to_csv(\n", " f\"trove-contributors-{datetime.datetime.now().strftime('%Y%m%d')}.csv\", index=False\n", ")" ] }, { "cell_type": "markdown", "id": "85b36652-c689-4174-b68e-af953e1a0c01", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "vscode": { "interpreter": { "hash": "830a82eb7dea304d2620966a757085a27772f251eec0147d9ba49327e586e1bd" } } }, "nbformat": 4, "nbformat_minor": 5 }