# Harvesting the complete set of data from the People and Organisations zone

There are two methods of harvesting the complete set of data from the People and Organisations zone – using the [OAI-PMH API](http://www.nla.gov.au/apps/peopleaustralia-oai/), or using the main Trove API in conjunction with the [SRU interface](http://www.nla.gov.au/apps/srw/search/peopleaustralia). The [OAI-PMH method](complete_harvest_oai.ipynb) is *much* faster, but includes duplicate records that you'll need to filter out afterwards. This notebook demonstrates the API/SRU method.

You can't use the SRU interface on its own as the SRU interface limits the lifespan of results sets, so attempting to traverse the complete database produces unexpected results. The main Trove API doesn't include full details of People and Organisations, but it does include identifiers, and does support bulk harvests. So you can harvest a complete list of identifiers from the main Trove API and then use these identifiers to request the full EAC-CPF records from the SRU interface. It's slow, but it seems to work.

I've saved a complete harvest of [all the people and organisations data](https://cloudstor.aarnet.edu.au/plus/s/3gyTJDJMsWWSKBi) on CloudStor (700mb zip file). The harvest was run on 23 January 2023. Each row in the dataset is a separate [EAC-CPF](https://eac.staatsbibliothek-berlin.de/) encoded XML file.

In [None]:
import json
import os
import re
import time
from datetime import datetime
from pathlib import Path

import requests_cache
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [None]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [None]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
 API_KEY = os.getenv("TROVE_API_KEY")

## Harvest identifiers from the Trove API

First we need to get the identifiers for all the people and organisations record from the main Trove API. We'll use a 'blank' search to get everything. The `bulkHarvest` parameter is necessary for large data harvests as it maintains the results set in a fixed order so you don't end up with duplicates.

In [None]:
params = {
 "zone": "people",
 "q": " ", # Blank search to get everything
 "bulkHarvest": "true",
 "encoding": "json",
 "key": API_KEY,
 "n": 100,
}

api_url = "https://api.trove.nla.gov.au/v2/result"

In [None]:
def get_total_results(params):
 params["n"] = 0
 response = s.get(api_url, params=params, timeout=30)
 data = response.json()
 return int(data["response"]["zone"][0]["records"]["total"])

In [None]:
peau_ids = []
total = get_total_results(params.copy())
start = "*"
with tqdm(total=total) as pbar:
 while start:
 params["s"] = start
 response = s.get(api_url, params=params)
 data = response.json()
 for record in data["response"]["zone"][0]["records"]["people"]:
 peau_ids.append(record["id"])
 # If there's more results there'll be a value for `nextStart`
 # that we use as the `start` vaue in the next request.
 try:
 start = data["response"]["zone"][0]["records"]["nextStart"]
 # If there's no nextStart value then we've finished!
 except KeyError:
 start = None
 pbar.update(len(data["response"]["zone"][0]["records"]["people"]))
 time.sleep(0.2)

In [None]:
# Write the identifiers to a file as backup
with Path(f"peau_ids_{datetime.now().strftime('%Y%m%d')}.json").open("w") as json_file:
 json.dump(peau_ids, json_file)

## Harvest EAC-CPF records

Now we have a big list of identifiers, we can use them to request the full records from the SRU interface.

In [None]:
# Basic params for SRU requests

p_params = {
 "version": "1.1",
 "operation": "searchRetrieve",
 "recordSchema": "urn:isbn:1-931666-33-4", # EAC-CPF encoding
 "maximumRecords": 10,
 "startRecord": 1,
 "resultSetTTL": 300,
 "recordPacking": "xml",
 "recordXPath": "",
 "sortKeys": "",
}

p_api_url = "http://www.nla.gov.au/apps/srw/search/peopleaustralia"

This retrieves a single EAC-CPF encoded record at a time, appending it to the `peau-data.xml` file. If the harvest is interrupted, delete `peau-data.xml` before restarting to avoid creating duplicates. Query results are cached, so a restarted harvest will grab results from the cache if possible.

The `peau-data.xml` file has one EAC-CPF encoded record per line. This makes it easier to save and process the records efficiently.

In [None]:
for p_id in tqdm(peau_ids):
 # Construct a party id using the identifier and use it to query the SRU interface using the rec.identifier field
 p_params["query"] = f'rec.identifier="http://nla.gov.au/nla.party-{p_id}"'
 response = s.get(p_api_url, params=p_params)
 soup = BeautifulSoup(response.content, "xml")
 with Path(f"peau-data-{datetime.now().strftime('%Y%m%d')}.xml").open(
 "a"
 ) as xml_file:
 for record in soup.find_all("record"):
 # Extract the EAC-CPF record
 eac_cpf = str(record.find("eac-cpf"))
 # Strip out any line breaks within the record
 eac_cpf = eac_cpf.replace("\n", "")
 eac_cpf = re.sub(r"\s+", " ", eac_cpf)
 # Write record as a new line
 xml_file.write(eac_cpf + "\n")
 if not response.from_cache:
 time.sleep(0.2)

This will take a long time. But it works. If you don't want to run your own harvest, just [download my pre-harvested dataset](https://cloudstor.aarnet.edu.au/plus/s/oshEZJPK3hL0JdQ) from CloudStor (700mb zip file). 

Now we can try [extracting some aggregate data from the people and organisations harvest](extract_aggregated_data_from_harvest.ipynb).

----

Created by [Tim Sherratt](http://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/).

The development of this notebook was supported by the [Australian Cultural Data Engine](https://www.acd-engine.org/).