{ "cells": [ { "cell_type": "markdown", "id": "3e96daca-497e-40f4-a38f-eeab1a1fd31d", "metadata": {}, "source": [ "# Harvesting the complete set of data from the People and Organisations zone\n", "\n", "There are two methods of harvesting the complete set of data from the People and Organisations zone – using the [OAI-PMH API](http://www.nla.gov.au/apps/peopleaustralia-oai/), or using the main Trove API in conjunction with the [SRU interface](http://www.nla.gov.au/apps/srw/search/peopleaustralia). The [OAI-PMH method](complete_harvest_oai.ipynb) is *much* faster, but includes duplicate records that you'll need to filter out afterwards. This notebook demonstrates the API/SRU method.\n", "\n", "You can't use the SRU interface on its own as the SRU interface limits the lifespan of results sets, so attempting to traverse the complete database produces unexpected results. The main Trove API doesn't include full details of People and Organisations, but it does include identifiers, and does support bulk harvests. So you can harvest a complete list of identifiers from the main Trove API and then use these identifiers to request the full EAC-CPF records from the SRU interface. It's slow, but it seems to work.\n", "\n", "I've saved a complete harvest of [all the people and organisations data](https://cloudstor.aarnet.edu.au/plus/s/3gyTJDJMsWWSKBi) on CloudStor (700mb zip file). The harvest was run on 23 January 2023. Each row in the dataset is a separate [EAC-CPF](https://eac.staatsbibliothek-berlin.de/) encoded XML file." ] }, { "cell_type": "code", "execution_count": null, "id": "3c3873ca-6761-4aa0-a688-7e396f9df4d5", "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import re\n", "import time\n", "from datetime import datetime\n", "from pathlib import Path\n", "\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": null, "id": "a1659814-ea9e-4b3a-86d4-389f69035ced", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Load variables from the .env file if it exists\n", "# Use %%capture to suppress messages\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": null, "id": "07193c2b-c5d0-4303-ad55-bbde10014da6", "metadata": {}, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "id": "b499a623-eea1-4bbd-95f0-4ae6794f47e6", "metadata": {}, "source": [ "## Harvest identifiers from the Trove API\n", "\n", "First we need to get the identifiers for all the people and organisations record from the main Trove API. We'll use a 'blank' search to get everything. The `bulkHarvest` parameter is necessary for large data harvests as it maintains the results set in a fixed order so you don't end up with duplicates." ] }, { "cell_type": "code", "execution_count": null, "id": "aaf51d0f-03d5-4cd7-b1eb-927658c7de08", "metadata": {}, "outputs": [], "source": [ "params = {\n", " \"zone\": \"people\",\n", " \"q\": \" \", # Blank search to get everything\n", " \"bulkHarvest\": \"true\",\n", " \"encoding\": \"json\",\n", " \"key\": API_KEY,\n", " \"n\": 100,\n", "}\n", "\n", "api_url = \"https://api.trove.nla.gov.au/v2/result\"" ] }, { "cell_type": "code", "execution_count": null, "id": "1e233756-045b-4127-9f4d-fa72e5ee07b0", "metadata": {}, "outputs": [], "source": [ "def get_total_results(params):\n", " params[\"n\"] = 0\n", " response = s.get(api_url, params=params, timeout=30)\n", " data = response.json()\n", " return int(data[\"response\"][\"zone\"][0][\"records\"][\"total\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "23bf2a7b-2eec-48a2-b8c9-96cd10f7ce87", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "peau_ids = []\n", "total = get_total_results(params.copy())\n", "start = \"*\"\n", "with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = s.get(api_url, params=params)\n", " data = response.json()\n", " for record in data[\"response\"][\"zone\"][0][\"records\"][\"people\"]:\n", " peau_ids.append(record[\"id\"])\n", " # If there's more results there'll be a value for `nextStart`\n", " # that we use as the `start` vaue in the next request.\n", " try:\n", " start = data[\"response\"][\"zone\"][0][\"records\"][\"nextStart\"]\n", " # If there's no nextStart value then we've finished!\n", " except KeyError:\n", " start = None\n", " pbar.update(len(data[\"response\"][\"zone\"][0][\"records\"][\"people\"]))\n", " time.sleep(0.2)" ] }, { "cell_type": "code", "execution_count": null, "id": "417fbc9c-ee79-4836-a532-742ffeb8c3d8", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Write the identifiers to a file as backup\n", "with Path(f\"peau_ids_{datetime.now().strftime('%Y%m%d')}.json\").open(\"w\") as json_file:\n", " json.dump(peau_ids, json_file)" ] }, { "cell_type": "markdown", "id": "a5545f34-3aae-42ed-b4ec-b384a81a2afe", "metadata": {}, "source": [ "## Harvest EAC-CPF records\n", "\n", "Now we have a big list of identifiers, we can use them to request the full records from the SRU interface." ] }, { "cell_type": "code", "execution_count": null, "id": "acc298d3-9c65-4a3e-a53d-ac814557dcd3", "metadata": {}, "outputs": [], "source": [ "# Basic params for SRU requests\n", "\n", "p_params = {\n", " \"version\": \"1.1\",\n", " \"operation\": \"searchRetrieve\",\n", " \"recordSchema\": \"urn:isbn:1-931666-33-4\", # EAC-CPF encoding\n", " \"maximumRecords\": 10,\n", " \"startRecord\": 1,\n", " \"resultSetTTL\": 300,\n", " \"recordPacking\": \"xml\",\n", " \"recordXPath\": \"\",\n", " \"sortKeys\": \"\",\n", "}\n", "\n", "p_api_url = \"http://www.nla.gov.au/apps/srw/search/peopleaustralia\"" ] }, { "cell_type": "markdown", "id": "bbfef6f8-b1ab-49b7-900d-cda2ea081402", "metadata": {}, "source": [ "This retrieves a single EAC-CPF encoded record at a time, appending it to the `peau-data.xml` file. If the harvest is interrupted, delete `peau-data.xml` before restarting to avoid creating duplicates. Query results are cached, so a restarted harvest will grab results from the cache if possible.\n", "\n", "The `peau-data.xml` file has one EAC-CPF encoded record per line. This makes it easier to save and process the records efficiently." ] }, { "cell_type": "code", "execution_count": null, "id": "ca29c03b-b303-4467-ae5c-ebd0044806be", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "for p_id in tqdm(peau_ids):\n", " # Construct a party id using the identifier and use it to query the SRU interface using the rec.identifier field\n", " p_params[\"query\"] = f'rec.identifier=\"http://nla.gov.au/nla.party-{p_id}\"'\n", " response = s.get(p_api_url, params=p_params)\n", " soup = BeautifulSoup(response.content, \"xml\")\n", " with Path(f\"peau-data-{datetime.now().strftime('%Y%m%d')}.xml\").open(\n", " \"a\"\n", " ) as xml_file:\n", " for record in soup.find_all(\"record\"):\n", " # Extract the EAC-CPF record\n", " eac_cpf = str(record.find(\"eac-cpf\"))\n", " # Strip out any line breaks within the record\n", " eac_cpf = eac_cpf.replace(\"\\n\", \"\")\n", " eac_cpf = re.sub(r\"\\s+\", \" \", eac_cpf)\n", " # Write record as a new line\n", " xml_file.write(eac_cpf + \"\\n\")\n", " if not response.from_cache:\n", " time.sleep(0.2)" ] }, { "cell_type": "markdown", "id": "e6b93607-f6ef-4721-85ae-bb1b96fc0a18", "metadata": {}, "source": [ "This will take a long time. But it works. If you don't want to run your own harvest, just [download my pre-harvested dataset](https://cloudstor.aarnet.edu.au/plus/s/oshEZJPK3hL0JdQ) from CloudStor (700mb zip file). " ] }, { "cell_type": "markdown", "id": "4096065d-14a3-4c5f-95f3-33d5c914181d", "metadata": {}, "source": [ "Now we can try [extracting some aggregate data from the people and organisations harvest](extract_aggregated_data_from_harvest.ipynb)." ] }, { "cell_type": "markdown", "id": "1cbaf95b-604d-421f-80b3-d48f1ebcbfec", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](http://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/).\n", "\n", "The development of this notebook was supported by the [Australian Cultural Data Engine](https://www.acd-engine.org/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": { "008655f876524e83a5c34d0ca691eb1c": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "04f845783a6b4b1f96efecff2d2b2576": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_0ea00769974046829b24d63ae37c7e48", "IPY_MODEL_ae7a0b3c99a04cd682abd41629fa335a", "IPY_MODEL_f68698dd825e42dea260b6c2358b3161" ], "layout": "IPY_MODEL_99e182d56b3c4024ba9dad2fec0b0574" } }, "0ac98a8cf27a4aebb719f1eccb9dd675": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "0b15047f43cc4f92bf4ce64cc40e67b2": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "0ea00769974046829b24d63ae37c7e48": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_20a1fe9729e64d75b9b1c4c4f71259d4", "style": "IPY_MODEL_55a1f711ef4d4086b9e3ba0043e9310f", "value": "100%" } }, "198f52b6d6b14fe49076b6f2d980b411": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "description_width": "" } }, "20a1fe9729e64d75b9b1c4c4f71259d4": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "2bd48247912c4df88a84401bc8d14a2b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "3044ecc7f2d34d58bb45ba667c2ae364": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "3997e3d1667c40ba844c133f6c643a06": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "3c98c3501d8947498c079e14f04caf1a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "description_width": "" } }, "3d2500550bcd4c73b2b7817483282302": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "55a1f711ef4d4086b9e3ba0043e9310f": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "5fa53d83853e40bc929871cc4f65a4cb": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "bar_style": "success", "layout": "IPY_MODEL_3044ecc7f2d34d58bb45ba667c2ae364", "max": 1310058, "style": "IPY_MODEL_198f52b6d6b14fe49076b6f2d980b411", "value": 1310058 } }, "8cfbaa97b6b347b4a93fa7eb859b40a9": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_940dedf844fb42cd90a47a250a15eb1e", "style": "IPY_MODEL_2bd48247912c4df88a84401bc8d14a2b", "value": " 1310058/1310058 [1:46:51<00:00, 182.15it/s]" } }, "940dedf844fb42cd90a47a250a15eb1e": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "99e182d56b3c4024ba9dad2fec0b0574": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "ae7a0b3c99a04cd682abd41629fa335a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "bar_style": "success", "layout": "IPY_MODEL_0b15047f43cc4f92bf4ce64cc40e67b2", "max": 1310058, "style": "IPY_MODEL_3c98c3501d8947498c079e14f04caf1a", "value": 1310058 } }, "bd794649955f4d6e8c3b2fcd9399e0c8": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_e6cb3d6de80b433381041d03fe592ad6", "style": "IPY_MODEL_3997e3d1667c40ba844c133f6c643a06", "value": "100%" } }, "da8730cd88364ceab0d1838854d89373": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_bd794649955f4d6e8c3b2fcd9399e0c8", "IPY_MODEL_5fa53d83853e40bc929871cc4f65a4cb", "IPY_MODEL_8cfbaa97b6b347b4a93fa7eb859b40a9" ], "layout": "IPY_MODEL_008655f876524e83a5c34d0ca691eb1c" } }, "e6cb3d6de80b433381041d03fe592ad6": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "f68698dd825e42dea260b6c2358b3161": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "layout": "IPY_MODEL_3d2500550bcd4c73b2b7817483282302", "style": "IPY_MODEL_0ac98a8cf27a4aebb719f1eccb9dd675", "value": " 1310058/1310058 [109:54:15<00:00, 3.43it/s]" } } }, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }