# Harvest oral histories metadata


Many of the National Library of Australia's oral histories are being made available online. This notebook harvests metadata describing the oral history collection from Trove and saves the results as a CSV file for further exploration.

For an [overview of the oral history collection](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html), see the *Trove Data Guide*.

If you're using data from the oral histories in Trove, you should read the section on [licensing of oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html#licensing-of-oral-histories) in the Trove Data Guide.

## Harvesting method

Harvesting information about digitised resources (other than newspapers) from Trove is complex. Individual records are often grouped into 'works', and only some of the metadata is available through the API. To work around these problems, this notebook includes the following processing steps:

- harvest search results from the Trove API, saving the individual version records if the resource is held by the NLA
- if the oral history is digitised, scrape additional metadata from the Trove audio player, and save information about summaries, transcripts, and audio files
- merge duplicate records

See the Trove Data Guide for more information on [accessing data about oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/accessing-data.html).

## Search parameters

You can find oral histories by setting the `l-format` facet to `Sound/Interview, lecture, talk`. Usually I'd combine this with the standard `"nla.obj"` search for digitised resources, but I thought it would be interesting to look at which oral histories were available online, and which weren't. So instead of `"nla.obj"`, I've used the `nuc:` index to limit results to resources held by the NLA – `nuc:ANL OR nuc:"ANL:DL"`.

## Pre-harvested dataset

You can [download a dataset](https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv) created by this notebook from from the [trove-oral-histories-data](https://github.com/GLAM-Workbench/trove-oral-histories-data) GitHub repository.


In [None]:
import json
import os
import re
from functools import reduce
from pathlib import Path

import pandas as pd
import requests_cache
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

load_dotenv()

In [None]:
s = requests_cache.CachedSession(timeout=60)
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [None]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
 API_KEY = os.getenv("TROVE_API_KEY")

## Harvest metadata from the Trove API

The code below is based on [HOW TO: Harvest data relating to digitised resources](https://tdg.glam-workbench.net/other-digitised-resources/how-to/harvest-digitised-resources.html) in the *Trove Data Guide*. It harvests search results using the Trove API, saving individual version records if the `holding` value includes one of the NUCs `ANL` or `ANL:DL` (which means that it's in the NLA's collection). Each version record is saved on a new line as a JSON object in a `ndjson` (Newline Delimited JSON) file.

In [None]:
def get_total_results(params, headers):
 """
 Get the total number of results for a search.
 """
 these_params = params.copy()
 these_params["n"] = 0
 response = s.get(
 "https://api.trove.nla.gov.au/v3/result", params=these_params, headers=headers
 )
 data = response.json()
 return int(data["category"][0]["records"]["total"])


def get_value(record, field, keys=["value"]):
 """
 Get the values of a field.
 Some fields are lists of dicts, if so use the `key` to get the value.
 """
 value = record.get(field, [])
 if value and isinstance(value[0], dict):
 for key in keys:
 try:
 return [re.sub(r"\s+", " ", v[key]) for v in value]
 except KeyError:
 pass
 else:
 return value


def merge_values(record, fields, keys=["value"]):
 """
 Merges values from multiple fields, removing any duplicates.
 """
 values = []
 for field in fields:
 values += get_value(record, field, keys)
 # Remove duplicates and None value
 return list(set([v for v in values if v is not None]))


def flatten_values(record, field, key="type"):
 """
 If a field has a value and type, return the values as strings with this format: 'type: value'
 """
 flattened = []
 values = record.get(field, [])
 for value in values:
 if key in value:
 flattened.append(f"{value[key]}: {value['value']}")
 else:
 flattened.append(value["value"])
 return flattened


def flatten_identifiers(record):
 """
 Get a list of control numbers from the identifier field and flatten the values.
 """
 ids = {
 "identifier": [
 v
 for v in record.get("identifier", [])
 if "type" in v and v["type"] == "control number"
 ]
 }
 return flatten_values(ids, "identifier", "source")


def get_fulltext_url(links):
 """
 Loop through the identifiers to find a link to the full text version of the book.
 """
 urls = []
 for link in links:
 if (
 "linktype" in link
 and link["linktype"] == "fulltext"
 and "nla.obj" in link["value"]
 ):
 url = re.sub(r"^http\b", "https", link["value"])
 link_text = link.get("linktext", "")
 urls.append({"url": url, "link_text": link_text})
 return urls


def get_catalogue_url(links):
 """
 Loop through the identifiers to find a link to the NLA catalogue.
 """
 for link in links:
 if (
 "linktype" in link
 and link["linktype"] == "notonline"
 and "nla.cat" in link["value"]
 ):
 return link["value"]
 return ""


def has_fulltext_link(links):
 """
 Check if a list of identifiers includes a fulltext url pointing to an NLA resource.
 """
 for link in links:
 if (
 "linktype" in link
 and link["linktype"] == "fulltext"
 and "nla.obj" in link["value"]
 ):
 return True


def has_holding(holdings, nucs):
 """
 Check if a list of holdings includes one of the supplied nucs.
 """
 for holding in holdings:
 if holding.get("nuc") in nucs:
 return True


def get_digitised_versions(work):
 """
 Get the versions from the given work that have a fulltext url pointing to an NLA resource
 in the `identifier` field.
 """
 versions = []
 for version in work["version"]:
 if "identifier" in version and has_fulltext_link(version["identifier"]):
 versions.append(version)
 return versions


def get_nuc_versions(work, nucs=["ANL", "ANL:DL"]):
 """
 Get the versions from the given work that are held by the NLA.
 """
 versions = []
 for version in work["version"]:
 if "holding" in version and has_holding(version["holding"], ["ANL", "ANL:DL"]):
 versions.append(version)
 return versions


def harvest_works(
 params,
 filter_by="url",
 nucs=["ANL", "ANL:DL"],
 output="oral-histories-metadata.ndjson",
):
 """
 Harvest metadata relating to digitised works.
 The filter_by parameter selects records for inclusion in the dataset, options:
 * url -- only include versions that have an NLA fulltext url
 * nuc -- only include versions that have an NLA nuc (ANL or ANL:DL)
 """
 default_params = {
 "category": "all",
 "bulkHarvest": "true",
 "n": 100,
 "encoding": "json",
 "include": ["links", "workversions", "holdings"],
 }
 params.update(default_params)
 headers = {"X-API-KEY": API_KEY}
 total = get_total_results(params, headers)
 start = "*"
 with Path(output).open("w") as ndjson_file:
 with tqdm(total=total) as pbar:
 while start:
 params["s"] = start
 response = s.get(
 "https://api.trove.nla.gov.au/v3/result",
 params=params,
 headers=headers,
 )
 data = response.json()
 items = data["category"][0]["records"]["item"]
 for item in items:
 for category, record in item.items():
 if category == "work":
 if filter_by == "nuc":
 versions = get_nuc_versions(record, nucs)
 else:
 versions = get_digitised_versions(record)
 for version in versions:
 for sub_version in version["record"]:
 metadata = sub_version["metadata"]["dc"]
 # Sometimes fulltext identifiers are only available on the
 # version rather than the sub version. So we'll look in the
 # sub version first, and if they're not there use the url from
 # the version.
 # Sometimes there are multiple fulltext urls associated with a version:
 # eg a collection page and a publication. If so add records for both urls.
 # They could end up pointing to the same digitised publication, but
 # we can sort that out later. Aim here is to try and not miss any possible
 # routes to digitised publications!
 urls = get_fulltext_url(
 metadata.get("identifier", [])
 )
 if len(urls) == 0:
 urls = get_fulltext_url(
 version.get("identifier", [])
 )
 if len(urls) == 0 and filter_by == "nuc":
 urls = [{"url": "", "link_text": ""}]
 for url in urls:
 work = {
 # This is not the full set of available fields,
 # adjust as necessary.
 "title": get_value(metadata, "title"),
 "work_url": record.get("troveUrl"),
 "work_type": record.get("type", []),
 "contributor": merge_values(
 metadata,
 ["creator", "contributor"],
 ["value", "name"],
 ),
 "publisher": get_value(
 metadata, "publisher"
 ),
 "date": merge_values(
 metadata, ["date", "issued"]
 ),
 # Using merge here because I've noticed some duplicate values
 "type": merge_values(metadata, ["type"]),
 "format": get_value(metadata, "format"),
 "rights": merge_values(
 metadata, ["rights", "licenseRef"]
 ),
 "language": get_value(metadata, "language"),
 "extent": get_value(metadata, "extent"),
 "subject": merge_values(
 metadata, ["subject"]
 ),
 "spatial": get_value(metadata, "spatial"),
 # Flattened type/value
 "is_part_of": flatten_values(
 metadata, "isPartOf"
 ),
 # Only get control numbers and flatten
 "identifier": flatten_identifiers(metadata),
 "fulltext_url": url["url"],
 "fulltext_url_text": url["link_text"],
 "catalogue_url": get_catalogue_url(
 metadata["identifier"]
 )
 # Could also add in data from bibliographicCitation
 # Although the types used in citations seem to vary by work and format.
 }
 ndjson_file.write(f"{json.dumps(work)}\n")
 # The nextStart parameter is used to get the next page of results.
 # If there's no nextStart then it means we're on the last page of results.
 try:
 start = data["category"][0]["records"]["nextStart"]
 except KeyError:
 start = None
 pbar.update(len(items))

In [None]:
# Do the harvest!
params = {
 "q": 'nuc:ANL OR nuc:"ANL:DL"',
 "l-format": "Sound/Interview, lecture, talk",
}

harvest_works(params, filter_by="nuc")

## Scrape additional metadata from the audio player

If the oral histories are digitised, they'll have a `fulltext_url` value which points to the Trove audio player. We can use this url to extract some additional metadata. See [HOW TO: Scrape metadata from the Trove audio player](https://tdg.glam-workbench.net/other-digitised-resources/how-to/scrape-metadata-audio-player.html) in the *Trove Data Guide*. The audio player also uses a Javascript file that lists details of sessions, audio files, and whether there is an associated summary or transcript. By downloading the Javascript file we can add this information to the dataset.

In [None]:
def scrape_metadata(url):
 """
 Scrape metadata about an oral history from the audio player page.
 """
 response = s.get(url)
 # If this is a collection page you'll get a 404
 if response.status_code != 200:
 return {}
 soup = BeautifulSoup(response.text)
 # Get the metadata container
 details = soup.find("div", class_="workdetails")
 if not details:
 return {}
 # Get link to NLA catalogue
 catalogue = details.find("section", class_="catalogue")
 catalogue_link = catalogue.find("a", href=re.compile("nla.cat-vn"))["href"]
 # Get oral history id
 oral_history_id = ""
 for string in catalogue.stripped_strings:
 if string.startswith("ORAL TRC"):
 oral_history_id = string
 # Get extent, description and notes
 extent = []
 description = []
 for section in details.find_all("section", class_="extent"):
 if section.string.startswith("Recorded"):
 description.append(section.string.strip())
 else:
 extent.append(section.string)
 try:
 notes = details.find("section", class_="notes").string
 except AttributeError:
 notes = ""
 # Get contributors and role
 contributors = []
 for div in details.find_all("div", class_="contributor"):
 role = div.find("span", class_="role")
 if role:
 contributors.append(f"{list(div.stripped_strings)[0]} {role.string}")
 else:
 contributors.append(f"{list(div.stripped_strings)[0]}")
 return {
 "catalogue_url": catalogue_link,
 "identifier": oral_history_id,
 "description": description,
 "extent": extent,
 "notes": notes,
 "contributor": contributors,
 }


def get_download_data(url):
 """
 Get information about sessions and files from a javascript file used by the audio player.
 """
 id = re.search(r"(nla\.obj\-\d+)", url).group(1)
 response = s.get(f"https://nla.gov.au/tarkine/listen/transcript/{id}.js")
 if response.status_code != 200:
 return {}
 # Extract the JSON data embedded in the JS function
 data = re.search(r"define\((\{.*)\)", response.text, re.DOTALL).group(1)
 # print(data)
 json_data = json.loads(data)
 return json_data


def enrich_metadata(
 input="oral-histories-metadata.ndjson",
 output="oral-histories-metadata-files.ndjson",
):
 """
 Enrich records for online oral histories by extracting additional metadata from the audio player.
 """
 total = sum(1 for _ in open(input))
 with Path(output).open("w") as ndjson_out:
 with Path(input).open("r") as ndjson_in:
 for line in tqdm(ndjson_in, total=total):
 work = json.loads(line)
 if url := work["fulltext_url"]:
 # Scrape additional metadata from audio player UI
 metadata = scrape_metadata(url)
 if metadata:
 work["catalogue_url"] = metadata["catalogue_url"]
 work["identifier"].append(metadata["identifier"])
 work["description"] = metadata["description"]
 work["notes"] = [metadata["notes"]]
 work["extent"] = list(set(work["extent"] + metadata["extent"]))
 if metadata["contributor"]:
 work["contributor"] = metadata["contributor"]
 # Get data about sessions, files, and transcripts from JS file
 downloads = get_download_data(url)
 if downloads:
 work["summary"] = downloads["anySummary"]
 work["transcript"] = downloads["anyTranscript"]
 sessions = downloads["sessionFiles"]
 work["sessions"] = len(sessions)
 file_ids = []
 duration = 0
 # Loop through all the sessions in this oral history
 for session in sessions:
 try:
 file = session["files"][0]
 except KeyError:
 # print(url)
 pass
 else:
 # Add id for this session's audio files
 file_ids.append(
 re.search(r"nla\.obj-\d+", file["href"]).group(
 0
 )
 )
 # Add the duration of this file to the total duration
 duration += file["duration"]
 work["duration"] = duration
 work["audio_file_ids"] = file_ids
 ndjson_out.write(f"{json.dumps(work)}\n")
 # If there's no metadata the fulltext url is probably giving a 404.
 # This is the case for 'collection' pages that don't actually seem to exist.
 # There are also a couple of fulltext urls that go to the image viewer.
 # These records are dropped from the dataset, but urls displayed for checking.
 else:
 print(work["fulltext_url"])
 else:
 ndjson_out.write(f"{json.dumps(work)}\n")

In [None]:
enrich_metadata()

## Merge duplicate records

The harvested data will contain duplicates. Some duplicates will be a result of splitting apart all the version groupings, but others are just in Trove to begin with. These duplicates are not *exactly* the same – they refer to the same thing, but can contain slightly different metadata. We want to combine them without losing any of this metadata. The strategy for this is to divide the columns into two sets – columns which we know only have one value and don't need to be merged, and columns that could contain multiple values that we want to deduplicate and merge, then we can:

- create a deduplicated dataframe from the first set of columns
- process the second set of columns by merging duplicate values and saving into a new dataframe
- combine the dataframes using a shared, unique identifier

The harvested data includes oral histories that haven't been digitised as well as those that have. We need to handle these separately, as the mix of columns and identifiers will be different. Once the digitised and not-digitised records are deduplicated, we can join them back together again.

In [None]:
def merge_column(columns):
 """
 Combine values from multiple columns, removing duplicates, and returning as a pipe-separated string.
 """
 values = []
 for value in columns:
 if isinstance(value, list):
 values += [str(v) for v in value if v]
 elif value:
 values.append(str(value))
 return " | ".join(sorted(set(values)))


def merge_records(df, int_columns, keep_columns, merge_columns, link_column):
 """
 Remove duplicate records in the supplied dataset by:
 - create a deduplicated datafrane with columns in `keep_columns`
 - merging values of columns in the `merge_columns` list, creating a new dataframe for each column
 - combine the deduplicated and merged column dataframes, linking on `link_column`
 """
 # Before I get rid of NANs set int cols to 0
 for int_col in int_columns:
 df[int_col].fillna(0, inplace=True)
 df[int_col] = df[int_col].astype("Int64")
 # Get rid of NANs so they don't cause problems when merging
 df.fillna("", inplace=True)

 # Add base dataset with columns that will always have only one value
 dfs = [df[keep_columns].drop_duplicates()]

 # Merge values from each column in turn, creating a new dataframe from each
 for column in merge_columns:
 dfs.append(df.groupby([link_column])[column].apply(merge_column).reset_index())

 # Merge all the individual dataframes into one, linking on `text_file` value
 df_merged = reduce(
 lambda left, right: pd.merge(left, right, on=[link_column], how="left"), dfs
 )
 return df_merged

First we load harvested metadata.

In [None]:
df = pd.read_json("oral-histories-metadata-files.ndjson", lines=True)

Then we create lists of columns that have single value, and those that can have multiple values and will be merged.

In [None]:
# Not digitised

int_columns = ["summary", "transcript", "sessions", "duration"]
keep_columns_nd = [
 "fulltext_url",
 "work_url",
 "summary",
 "transcript",
 "sessions",
 "duration",
]
merge_columns_nd = [
 "title",
 "work_type",
 "contributor",
 "publisher",
 "date",
 "type",
 "format",
 "extent",
 "language",
 "subject",
 "spatial",
 "is_part_of",
 "identifier",
 "rights",
 "fulltext_url_text",
 "catalogue_url",
 "audio_file_ids",
]

# Digitised oral histories

keep_columns_d = ["fulltext_url", "summary", "transcript", "sessions", "duration"]
merge_columns_d = [
 "title",
 "work_url",
 "work_type",
 "contributor",
 "publisher",
 "date",
 "type",
 "format",
 "extent",
 "language",
 "subject",
 "spatial",
 "is_part_of",
 "identifier",
 "rights",
 "fulltext_url_text",
 "catalogue_url",
 "audio_file_ids",
]

We merge the digitised and not-digitised oral histories separately, then combine the results.

In [None]:
df_merged_digitised = merge_records(
 df.copy().loc[df["fulltext_url"] != ""],
 int_columns,
 keep_columns_d,
 merge_columns_d,
 "fulltext_url",
)

df_merged_not_digitised = merge_records(
 df.copy().loc[df["fulltext_url"] == ""],
 int_columns,
 keep_columns_nd,
 merge_columns_nd,
 "work_url",
)

df_merged = pd.concat([df_merged_not_digitised, df_merged_digitised])

Save the merged dataset as a CSV file.

In [None]:
df_merged[
 [
 "title",
 "contributor",
 "publisher",
 "date",
 "type",
 "format",
 "extent",
 "language",
 "subject",
 "spatial",
 "is_part_of",
 "identifier",
 "rights",
 "work_url",
 "work_type",
 "fulltext_url",
 "fulltext_url_text",
 "catalogue_url",
 "summary",
 "transcript",
 "sessions",
 "duration",
 ]
].sort_values("title").to_csv("trove-oral-histories.csv", index=False)

In [None]:
# FOR TESTING ONLY -- PLEASE IGNORE

md_output = Path("test-metadata.ndjson")
md_output_head = Path("test-metadata-head.ndjson")
md_files_output = Path("test-metadata-files.ndjson")

# Do the API harvest!
params = {
 "q": 'nuc:ANL OR nuc:"ANL:DL"',
 "l-format": "Sound/Interview, lecture, talk",
}

with s.cache_disabled():
 harvest_works(params, filter_by="nuc", output=md_output)

# Create a subset for enriching
with md_output.open("r") as input_file:
 head = [next(input_file) for _ in range(100)]

with md_output_head.open("w") as head_file:
 for line in head:
 head_file.write(line)

# Enrich
with s.cache_disabled():
 enrich_metadata(input=md_output_head, output=md_files_output)

df = pd.read_json(md_files_output, lines=True)

df_merged_not_digitised = merge_records(
 df.copy().loc[df["fulltext_url"] == ""],
 int_columns,
 keep_columns_nd,
 merge_columns_nd,
 "work_url",
)

df_merged_digitised = merge_records(
 df.copy().loc[df["fulltext_url"] != ""],
 int_columns,
 keep_columns_d,
 merge_columns_d,
 "fulltext_url",
)

df_merged = pd.concat([df_merged_not_digitised, df_merged_digitised])

md_output.unlink()
md_output_head.unlink()
md_files_output.unlink()

----

Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.net/).