# Harvesting the complete set of data from the People and Organisations zone using OAI-PMH

There are two methods of harvesting the complete set of data from the People and Organisations zone – using the [OAI-PMH API](http://www.nla.gov.au/apps/peopleaustralia-oai/), or using the main Trove API in conjunction with the [SRU interface](http://www.nla.gov.au/apps/srw/search/peopleaustralia). The OAI-PMH method is *much* faster, but includes duplicate records that you'll need to filter out afterwards. This notebook demonstrates the OAI-PMH method.

Using the `ListRecords` method with the OAI-PMH API will cause an error unless you provide a `set` parameter. There is a set for each organisation contributing data to the People and Organisations zone. So to harvest everything you need to loop through the list of sets, downloading the records for each. However, records are matched and merged across sets, so the same record can appear in multiple sets. This means your harvest will contain duplicate records.

The list of records also includes records that have been deleted, these records are effectively empty containing only an identifier and date. They're not saved as part of the harvest, but you could adjust the code below to save them to file if you wanted to.

The results of the harvest are saved to a file named with the current date – `peau-oai-data-YYYYMMDD.xml`. Each row in the dataset is a separate EAC-CPF encoded XML file.

In [None]:
import re
from datetime import datetime
from pathlib import Path

import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

# Not using requests_cache as caching results causes problems with resumptionToken
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

## Get the list of sets

First use the `ListSets` method to get a list of all the available sets.

In [None]:
# Get a list of sets
response = s.get(
 "http://www.nla.gov.au/apps/peopleaustralia-oai/OAIHandler?verb=ListSets"
)
set_soup = BeautifulSoup(response.text, features="xml")

## Get the records

Next loop through the list of sets, using the `ListRecords` mthod with the `set` parameter to get all the records in each set.

This retrieves a single EAC-CPF encoded record at a time, appending it to the `peau-data-oai-YYYYMMDD.xml` file. If the harvest is interrupted, delete the output file before restarting to avoid creating duplicates.

In [None]:
# Set the output file
output = Path(f"peau-oai-data-{datetime.now().strftime('%Y%m%d')}.xml")

# Loop through the sets
for source in set_soup.find_all("set"):
 source_id = source.setSpec.string
 print(source_id)
 # Get the records in this set
 with tqdm() as pbar:
 params = {"verb": "ListRecords", "set": source_id, "metadataPrefix": "eac-cpf"}
 # OAI-PMH uses resumption token to paginate through the coomplete results set
 # We'll continue harvesting until there's no resumption token
 while params:
 response = s.get(
 "http://www.nla.gov.au/apps/peopleaustralia-oai/OAIHandler",
 params=params,
 )
 soup = BeautifulSoup(response.text, features="xml")
 with output.open("a") as xml_file:
 for record in soup.find_all("record"):
 # Extract the EAC-CPF record
 # First check there is an eac-cpf record inside
 # If there's not, it's probably a deleted record
 if not record.find("eac-cpf"):
 # You could change this to display or save deleted records
 # print(record)
 break
 eac_cpf = str(record.find("eac-cpf"))
 # Strip out any line breaks within the record
 eac_cpf = eac_cpf.replace("\n", "")
 eac_cpf = re.sub(r"\s+", " ", eac_cpf)
 # Write record as a new line
 xml_file.write(eac_cpf + "\n")
 pbar.update(1)
 # Get the resumption token and add to request params
 if soup.find("resumptionToken") and soup.find("resumptionToken").string:
 if not pbar.total:
 pbar.total = int(soup.find("resumptionToken")["completeListSize"])
 params = {
 "verb": "ListRecords",
 "resumptionToken": soup.find("resumptionToken").string,
 }
 else:
 params = None

----

Created by [Tim Sherratt](http://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/).

The development of this notebook was supported by the [Australian Cultural Data Engine](https://www.acd-engine.org/).