{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "21cee6b0-eb70-4034-a0ea-62272485ec9d",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# Harvest oral histories metadata\n",
    "\n",
    "\n",
    "Many of the National Library of Australia's oral histories are being made available online. This notebook harvests metadata describing the oral history collection from Trove and saves the results as a CSV file for further exploration.\n",
    "\n",
    "For an [overview of the oral history collection](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html), see the *Trove Data Guide*.\n",
    "\n",
    "If you're using data from the oral histories in Trove, you should read the section on [licensing of oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html#licensing-of-oral-histories) in the Trove Data Guide.\n",
    "\n",
    "## Harvesting method\n",
    "\n",
    "Harvesting information about digitised resources (other than newspapers) from Trove is complex. Individual records are often grouped into 'works', and only some of the metadata is available through the API. To work around these problems, this notebook includes the following processing steps:\n",
    "\n",
    "- harvest search results from the Trove API, saving the individual version records if the resource is held by the NLA\n",
    "- if the oral history is digitised, scrape additional metadata from the Trove audio player, and save information about summaries, transcripts, and audio files\n",
    "- merge duplicate records\n",
    "\n",
    "See the Trove Data Guide for more information on [accessing data about oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/accessing-data.html).\n",
    "\n",
    "## Search parameters\n",
    "\n",
    "You can find oral histories by setting the `l-format` facet to `Sound/Interview, lecture, talk`. Usually I'd combine this with the standard `\"nla.obj\"` search for digitised resources, but I thought it would be interesting to look at which oral histories were available online, and which weren't. So instead of `\"nla.obj\"`, I've used the `nuc:` index to limit results to resources held by the NLA – `nuc:ANL OR nuc:\"ANL:DL\"`.\n",
    "\n",
    "## Pre-harvested dataset\n",
    "\n",
    "You can [download a dataset](https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv) created by this notebook from from the [trove-oral-histories-data](https://github.com/GLAM-Workbench/trove-oral-histories-data) GitHub repository.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec2f99b5-eacd-4c45-827d-a154bbe6c188",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import re\n",
    "from functools import reduce\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import requests_cache\n",
    "from bs4 import BeautifulSoup\n",
    "from dotenv import load_dotenv\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf2b9d63-ed83-4466-8b84-fed5edb26058",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = requests_cache.CachedSession(timeout=60)\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "384e0102-4e92-4566-9fa4-e2fc6306fad8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Insert your Trove API key\n",
    "API_KEY = \"YOUR API KEY\"\n",
    "\n",
    "# Use api key value from environment variables if it is available\n",
    "if os.getenv(\"TROVE_API_KEY\"):\n",
    "    API_KEY = os.getenv(\"TROVE_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88b092fc-4e12-40ea-8a70-961115e9d980",
   "metadata": {},
   "source": [
    "## Harvest metadata from the Trove API\n",
    "\n",
    "The code below is based on [HOW TO: Harvest data relating to digitised resources](https://tdg.glam-workbench.net/other-digitised-resources/how-to/harvest-digitised-resources.html) in the *Trove Data Guide*. It harvests search results using the Trove API, saving individual version records if the `holding` value includes one of the NUCs `ANL` or `ANL:DL` (which means that it's in the NLA's collection). Each version record is saved on a new line as a JSON object in a `ndjson` (Newline Delimited JSON) file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8fc2569-141a-4749-808f-2bd8b8405467",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_total_results(params, headers):\n",
    "    \"\"\"\n",
    "    Get the total number of results for a search.\n",
    "    \"\"\"\n",
    "    these_params = params.copy()\n",
    "    these_params[\"n\"] = 0\n",
    "    response = s.get(\n",
    "        \"https://api.trove.nla.gov.au/v3/result\", params=these_params, headers=headers\n",
    "    )\n",
    "    data = response.json()\n",
    "    return int(data[\"category\"][0][\"records\"][\"total\"])\n",
    "\n",
    "\n",
    "def get_value(record, field, keys=[\"value\"]):\n",
    "    \"\"\"\n",
    "    Get the values of a field.\n",
    "    Some fields are lists of dicts, if so use the `key` to get the value.\n",
    "    \"\"\"\n",
    "    value = record.get(field, [])\n",
    "    if value and isinstance(value[0], dict):\n",
    "        for key in keys:\n",
    "            try:\n",
    "                return [re.sub(r\"\\s+\", \" \", v[key]) for v in value]\n",
    "            except KeyError:\n",
    "                pass\n",
    "    else:\n",
    "        return value\n",
    "\n",
    "\n",
    "def merge_values(record, fields, keys=[\"value\"]):\n",
    "    \"\"\"\n",
    "    Merges values from multiple fields, removing any duplicates.\n",
    "    \"\"\"\n",
    "    values = []\n",
    "    for field in fields:\n",
    "        values += get_value(record, field, keys)\n",
    "    # Remove duplicates and None value\n",
    "    return list(set([v for v in values if v is not None]))\n",
    "\n",
    "\n",
    "def flatten_values(record, field, key=\"type\"):\n",
    "    \"\"\"\n",
    "    If a field has a value and type, return the values as strings with this format: 'type: value'\n",
    "    \"\"\"\n",
    "    flattened = []\n",
    "    values = record.get(field, [])\n",
    "    for value in values:\n",
    "        if key in value:\n",
    "            flattened.append(f\"{value[key]}: {value['value']}\")\n",
    "        else:\n",
    "            flattened.append(value[\"value\"])\n",
    "    return flattened\n",
    "\n",
    "\n",
    "def flatten_identifiers(record):\n",
    "    \"\"\"\n",
    "    Get a list of control numbers from the identifier field and flatten the values.\n",
    "    \"\"\"\n",
    "    ids = {\n",
    "        \"identifier\": [\n",
    "            v\n",
    "            for v in record.get(\"identifier\", [])\n",
    "            if \"type\" in v and v[\"type\"] == \"control number\"\n",
    "        ]\n",
    "    }\n",
    "    return flatten_values(ids, \"identifier\", \"source\")\n",
    "\n",
    "\n",
    "def get_fulltext_url(links):\n",
    "    \"\"\"\n",
    "    Loop through the identifiers to find a link to the full text version of the book.\n",
    "    \"\"\"\n",
    "    urls = []\n",
    "    for link in links:\n",
    "        if (\n",
    "            \"linktype\" in link\n",
    "            and link[\"linktype\"] == \"fulltext\"\n",
    "            and \"nla.obj\" in link[\"value\"]\n",
    "        ):\n",
    "            url = re.sub(r\"^http\\b\", \"https\", link[\"value\"])\n",
    "            link_text = link.get(\"linktext\", \"\")\n",
    "            urls.append({\"url\": url, \"link_text\": link_text})\n",
    "    return urls\n",
    "\n",
    "\n",
    "def get_catalogue_url(links):\n",
    "    \"\"\"\n",
    "    Loop through the identifiers to find a link to the NLA catalogue.\n",
    "    \"\"\"\n",
    "    for link in links:\n",
    "        if (\n",
    "            \"linktype\" in link\n",
    "            and link[\"linktype\"] == \"notonline\"\n",
    "            and \"nla.cat\" in link[\"value\"]\n",
    "        ):\n",
    "            return link[\"value\"]\n",
    "    return \"\"\n",
    "\n",
    "\n",
    "def has_fulltext_link(links):\n",
    "    \"\"\"\n",
    "    Check if a list of identifiers includes a fulltext url pointing to an NLA resource.\n",
    "    \"\"\"\n",
    "    for link in links:\n",
    "        if (\n",
    "            \"linktype\" in link\n",
    "            and link[\"linktype\"] == \"fulltext\"\n",
    "            and \"nla.obj\" in link[\"value\"]\n",
    "        ):\n",
    "            return True\n",
    "\n",
    "\n",
    "def has_holding(holdings, nucs):\n",
    "    \"\"\"\n",
    "    Check if a list of holdings includes one of the supplied nucs.\n",
    "    \"\"\"\n",
    "    for holding in holdings:\n",
    "        if holding.get(\"nuc\") in nucs:\n",
    "            return True\n",
    "\n",
    "\n",
    "def get_digitised_versions(work):\n",
    "    \"\"\"\n",
    "    Get the versions from the given work that have a fulltext url pointing to an NLA resource\n",
    "    in the `identifier` field.\n",
    "    \"\"\"\n",
    "    versions = []\n",
    "    for version in work[\"version\"]:\n",
    "        if \"identifier\" in version and has_fulltext_link(version[\"identifier\"]):\n",
    "            versions.append(version)\n",
    "    return versions\n",
    "\n",
    "\n",
    "def get_nuc_versions(work, nucs=[\"ANL\", \"ANL:DL\"]):\n",
    "    \"\"\"\n",
    "    Get the versions from the given work that are held by the NLA.\n",
    "    \"\"\"\n",
    "    versions = []\n",
    "    for version in work[\"version\"]:\n",
    "        if \"holding\" in version and has_holding(version[\"holding\"], [\"ANL\", \"ANL:DL\"]):\n",
    "            versions.append(version)\n",
    "    return versions\n",
    "\n",
    "\n",
    "def harvest_works(\n",
    "    params,\n",
    "    filter_by=\"url\",\n",
    "    nucs=[\"ANL\", \"ANL:DL\"],\n",
    "    output=\"oral-histories-metadata.ndjson\",\n",
    "):\n",
    "    \"\"\"\n",
    "    Harvest metadata relating to digitised works.\n",
    "    The filter_by parameter selects records for inclusion in the dataset, options:\n",
    "        * url -- only include versions that have an NLA fulltext url\n",
    "        * nuc -- only include versions that have an NLA nuc (ANL or ANL:DL)\n",
    "    \"\"\"\n",
    "    default_params = {\n",
    "        \"category\": \"all\",\n",
    "        \"bulkHarvest\": \"true\",\n",
    "        \"n\": 100,\n",
    "        \"encoding\": \"json\",\n",
    "        \"include\": [\"links\", \"workversions\", \"holdings\"],\n",
    "    }\n",
    "    params.update(default_params)\n",
    "    headers = {\"X-API-KEY\": API_KEY}\n",
    "    total = get_total_results(params, headers)\n",
    "    start = \"*\"\n",
    "    with Path(output).open(\"w\") as ndjson_file:\n",
    "        with tqdm(total=total) as pbar:\n",
    "            while start:\n",
    "                params[\"s\"] = start\n",
    "                response = s.get(\n",
    "                    \"https://api.trove.nla.gov.au/v3/result\",\n",
    "                    params=params,\n",
    "                    headers=headers,\n",
    "                )\n",
    "                data = response.json()\n",
    "                items = data[\"category\"][0][\"records\"][\"item\"]\n",
    "                for item in items:\n",
    "                    for category, record in item.items():\n",
    "                        if category == \"work\":\n",
    "                            if filter_by == \"nuc\":\n",
    "                                versions = get_nuc_versions(record, nucs)\n",
    "                            else:\n",
    "                                versions = get_digitised_versions(record)\n",
    "                            for version in versions:\n",
    "                                for sub_version in version[\"record\"]:\n",
    "                                    metadata = sub_version[\"metadata\"][\"dc\"]\n",
    "                                    # Sometimes fulltext identifiers are only available on the\n",
    "                                    # version rather than the sub version. So we'll look in the\n",
    "                                    # sub version first, and if they're not there use the url from\n",
    "                                    # the version.\n",
    "                                    # Sometimes there are multiple fulltext urls associated with a version:\n",
    "                                    # eg a collection page and a publication. If so add records for both urls.\n",
    "                                    # They could end up pointing to the same digitised publication, but\n",
    "                                    # we can sort that out later. Aim here is to try and not miss any possible\n",
    "                                    # routes to digitised publications!\n",
    "                                    urls = get_fulltext_url(\n",
    "                                        metadata.get(\"identifier\", [])\n",
    "                                    )\n",
    "                                    if len(urls) == 0:\n",
    "                                        urls = get_fulltext_url(\n",
    "                                            version.get(\"identifier\", [])\n",
    "                                        )\n",
    "                                    if len(urls) == 0 and filter_by == \"nuc\":\n",
    "                                        urls = [{\"url\": \"\", \"link_text\": \"\"}]\n",
    "                                    for url in urls:\n",
    "                                        work = {\n",
    "                                            # This is not the full set of available fields,\n",
    "                                            # adjust as necessary.\n",
    "                                            \"title\": get_value(metadata, \"title\"),\n",
    "                                            \"work_url\": record.get(\"troveUrl\"),\n",
    "                                            \"work_type\": record.get(\"type\", []),\n",
    "                                            \"contributor\": merge_values(\n",
    "                                                metadata,\n",
    "                                                [\"creator\", \"contributor\"],\n",
    "                                                [\"value\", \"name\"],\n",
    "                                            ),\n",
    "                                            \"publisher\": get_value(\n",
    "                                                metadata, \"publisher\"\n",
    "                                            ),\n",
    "                                            \"date\": merge_values(\n",
    "                                                metadata, [\"date\", \"issued\"]\n",
    "                                            ),\n",
    "                                            # Using merge here because I've noticed some duplicate values\n",
    "                                            \"type\": merge_values(metadata, [\"type\"]),\n",
    "                                            \"format\": get_value(metadata, \"format\"),\n",
    "                                            \"rights\": merge_values(\n",
    "                                                metadata, [\"rights\", \"licenseRef\"]\n",
    "                                            ),\n",
    "                                            \"language\": get_value(metadata, \"language\"),\n",
    "                                            \"extent\": get_value(metadata, \"extent\"),\n",
    "                                            \"subject\": merge_values(\n",
    "                                                metadata, [\"subject\"]\n",
    "                                            ),\n",
    "                                            \"spatial\": get_value(metadata, \"spatial\"),\n",
    "                                            # Flattened type/value\n",
    "                                            \"is_part_of\": flatten_values(\n",
    "                                                metadata, \"isPartOf\"\n",
    "                                            ),\n",
    "                                            # Only get control numbers and flatten\n",
    "                                            \"identifier\": flatten_identifiers(metadata),\n",
    "                                            \"fulltext_url\": url[\"url\"],\n",
    "                                            \"fulltext_url_text\": url[\"link_text\"],\n",
    "                                            \"catalogue_url\": get_catalogue_url(\n",
    "                                                metadata[\"identifier\"]\n",
    "                                            )\n",
    "                                            # Could also add in data from bibliographicCitation\n",
    "                                            # Although the types used in citations seem to vary by work and format.\n",
    "                                        }\n",
    "                                        ndjson_file.write(f\"{json.dumps(work)}\\n\")\n",
    "                # The nextStart parameter is used to get the next page of results.\n",
    "                # If there's no nextStart then it means we're on the last page of results.\n",
    "                try:\n",
    "                    start = data[\"category\"][0][\"records\"][\"nextStart\"]\n",
    "                except KeyError:\n",
    "                    start = None\n",
    "                pbar.update(len(items))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42e0e826-f218-4e1c-97a9-8a85a3263cb3",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Do the harvest!\n",
    "params = {\n",
    "    \"q\": 'nuc:ANL OR nuc:\"ANL:DL\"',\n",
    "    \"l-format\": \"Sound/Interview, lecture, talk\",\n",
    "}\n",
    "\n",
    "harvest_works(params, filter_by=\"nuc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca071ac3-895a-4cc6-85fc-e1a867d3d917",
   "metadata": {},
   "source": [
    "## Scrape additional metadata from the audio player\n",
    "\n",
    "If the oral histories are digitised, they'll have a `fulltext_url` value which points to the Trove audio player. We can use this url to extract some additional metadata. See [HOW TO: Scrape metadata from the Trove audio player](https://tdg.glam-workbench.net/other-digitised-resources/how-to/scrape-metadata-audio-player.html) in the *Trove Data Guide*. The audio player also uses a Javascript file that lists details of sessions, audio files, and whether there is an associated summary or transcript. By downloading the Javascript file we can add this information to the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d58da952-6a20-442e-b3fc-faa0478de503",
   "metadata": {},
   "outputs": [],
   "source": [
    "def scrape_metadata(url):\n",
    "    \"\"\"\n",
    "    Scrape metadata about an oral history from the audio player page.\n",
    "    \"\"\"\n",
    "    response = s.get(url)\n",
    "    # If this is a collection page you'll get a 404\n",
    "    if response.status_code != 200:\n",
    "        return {}\n",
    "    soup = BeautifulSoup(response.text)\n",
    "    # Get the metadata container\n",
    "    details = soup.find(\"div\", class_=\"workdetails\")\n",
    "    if not details:\n",
    "        return {}\n",
    "    # Get link to NLA catalogue\n",
    "    catalogue = details.find(\"section\", class_=\"catalogue\")\n",
    "    catalogue_link = catalogue.find(\"a\", href=re.compile(\"nla.cat-vn\"))[\"href\"]\n",
    "    # Get oral history id\n",
    "    oral_history_id = \"\"\n",
    "    for string in catalogue.stripped_strings:\n",
    "        if string.startswith(\"ORAL TRC\"):\n",
    "            oral_history_id = string\n",
    "    # Get extent, description and notes\n",
    "    extent = []\n",
    "    description = []\n",
    "    for section in details.find_all(\"section\", class_=\"extent\"):\n",
    "        if section.string.startswith(\"Recorded\"):\n",
    "            description.append(section.string.strip())\n",
    "        else:\n",
    "            extent.append(section.string)\n",
    "    try:\n",
    "        notes = details.find(\"section\", class_=\"notes\").string\n",
    "    except AttributeError:\n",
    "        notes = \"\"\n",
    "    # Get contributors and role\n",
    "    contributors = []\n",
    "    for div in details.find_all(\"div\", class_=\"contributor\"):\n",
    "        role = div.find(\"span\", class_=\"role\")\n",
    "        if role:\n",
    "            contributors.append(f\"{list(div.stripped_strings)[0]} {role.string}\")\n",
    "        else:\n",
    "            contributors.append(f\"{list(div.stripped_strings)[0]}\")\n",
    "    return {\n",
    "        \"catalogue_url\": catalogue_link,\n",
    "        \"identifier\": oral_history_id,\n",
    "        \"description\": description,\n",
    "        \"extent\": extent,\n",
    "        \"notes\": notes,\n",
    "        \"contributor\": contributors,\n",
    "    }\n",
    "\n",
    "\n",
    "def get_download_data(url):\n",
    "    \"\"\"\n",
    "    Get information about sessions and files from a javascript file used by the audio player.\n",
    "    \"\"\"\n",
    "    id = re.search(r\"(nla\\.obj\\-\\d+)\", url).group(1)\n",
    "    response = s.get(f\"https://nla.gov.au/tarkine/listen/transcript/{id}.js\")\n",
    "    if response.status_code != 200:\n",
    "        return {}\n",
    "    # Extract the JSON data embedded in the JS function\n",
    "    data = re.search(r\"define\\((\\{.*)\\)\", response.text, re.DOTALL).group(1)\n",
    "    # print(data)\n",
    "    json_data = json.loads(data)\n",
    "    return json_data\n",
    "\n",
    "\n",
    "def enrich_metadata(\n",
    "    input=\"oral-histories-metadata.ndjson\",\n",
    "    output=\"oral-histories-metadata-files.ndjson\",\n",
    "):\n",
    "    \"\"\"\n",
    "    Enrich records for online oral histories by extracting additional metadata from the audio player.\n",
    "    \"\"\"\n",
    "    total = sum(1 for _ in open(input))\n",
    "    with Path(output).open(\"w\") as ndjson_out:\n",
    "        with Path(input).open(\"r\") as ndjson_in:\n",
    "            for line in tqdm(ndjson_in, total=total):\n",
    "                work = json.loads(line)\n",
    "                if url := work[\"fulltext_url\"]:\n",
    "                    # Scrape additional metadata from audio player UI\n",
    "                    metadata = scrape_metadata(url)\n",
    "                    if metadata:\n",
    "                        work[\"catalogue_url\"] = metadata[\"catalogue_url\"]\n",
    "                        work[\"identifier\"].append(metadata[\"identifier\"])\n",
    "                        work[\"description\"] = metadata[\"description\"]\n",
    "                        work[\"notes\"] = [metadata[\"notes\"]]\n",
    "                        work[\"extent\"] = list(set(work[\"extent\"] + metadata[\"extent\"]))\n",
    "                        if metadata[\"contributor\"]:\n",
    "                            work[\"contributor\"] = metadata[\"contributor\"]\n",
    "                        # Get data about sessions, files, and transcripts from JS file\n",
    "                        downloads = get_download_data(url)\n",
    "                        if downloads:\n",
    "                            work[\"summary\"] = downloads[\"anySummary\"]\n",
    "                            work[\"transcript\"] = downloads[\"anyTranscript\"]\n",
    "                            sessions = downloads[\"sessionFiles\"]\n",
    "                            work[\"sessions\"] = len(sessions)\n",
    "                            file_ids = []\n",
    "                            duration = 0\n",
    "                            # Loop through all the sessions in this oral history\n",
    "                            for session in sessions:\n",
    "                                try:\n",
    "                                    file = session[\"files\"][0]\n",
    "                                except KeyError:\n",
    "                                    # print(url)\n",
    "                                    pass\n",
    "                                else:\n",
    "                                    # Add id for this session's audio files\n",
    "                                    file_ids.append(\n",
    "                                        re.search(r\"nla\\.obj-\\d+\", file[\"href\"]).group(\n",
    "                                            0\n",
    "                                        )\n",
    "                                    )\n",
    "                                    # Add the duration of this file to the total duration\n",
    "                                    duration += file[\"duration\"]\n",
    "                            work[\"duration\"] = duration\n",
    "                            work[\"audio_file_ids\"] = file_ids\n",
    "                        ndjson_out.write(f\"{json.dumps(work)}\\n\")\n",
    "                    # If there's no metadata the fulltext url is probably giving a 404.\n",
    "                    # This is the case for 'collection' pages that don't actually seem to exist.\n",
    "                    # There are also a couple of fulltext urls that go to the image viewer.\n",
    "                    # These records are dropped from the dataset, but urls displayed for checking.\n",
    "                    else:\n",
    "                        print(work[\"fulltext_url\"])\n",
    "                else:\n",
    "                    ndjson_out.write(f\"{json.dumps(work)}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff19f5f0-e69b-4d31-910a-1cd23b1b65a1",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "enrich_metadata()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf9dcab8-5f7e-44f2-a722-5216e57c7cb1",
   "metadata": {},
   "source": [
    "## Merge duplicate records\n",
    "\n",
    "The harvested data will contain duplicates. Some duplicates will be a result of splitting apart all the version groupings, but others are just in Trove to begin with. These duplicates are not *exactly* the same – they refer to the same thing, but can contain slightly different metadata. We want to combine them without losing any of this metadata. The strategy for this is to divide the columns into two sets – columns which we know only have one value and don't need to be merged, and columns that could contain multiple values that we want to deduplicate and merge, then we can:\n",
    "\n",
    "- create a deduplicated dataframe from the first set of columns\n",
    "- process the second set of columns by merging duplicate values and saving into a new dataframe\n",
    "- combine the dataframes using a shared, unique identifier\n",
    "\n",
    "The harvested data includes oral histories that haven't been digitised as well as those that have. We need to handle these separately, as the mix of columns and identifiers will be different. Once the digitised and not-digitised records are deduplicated, we can join them back together again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00d2f7b9-8b7b-4a88-b190-3a9edf823d3c",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def merge_column(columns):\n",
    "    \"\"\"\n",
    "    Combine values from multiple columns, removing duplicates, and returning as a pipe-separated string.\n",
    "    \"\"\"\n",
    "    values = []\n",
    "    for value in columns:\n",
    "        if isinstance(value, list):\n",
    "            values += [str(v) for v in value if v]\n",
    "        elif value:\n",
    "            values.append(str(value))\n",
    "    return \" | \".join(sorted(set(values)))\n",
    "\n",
    "\n",
    "def merge_records(df, int_columns, keep_columns, merge_columns, link_column):\n",
    "    \"\"\"\n",
    "    Remove duplicate records in the supplied dataset by:\n",
    "    - create a deduplicated datafrane with columns in `keep_columns`\n",
    "    - merging values of columns in the `merge_columns` list, creating a new dataframe for each column\n",
    "    - combine the deduplicated and merged column dataframes, linking on `link_column`\n",
    "    \"\"\"\n",
    "    # Before I get rid of NANs set int cols to 0\n",
    "    for int_col in int_columns:\n",
    "        df[int_col].fillna(0, inplace=True)\n",
    "        df[int_col] = df[int_col].astype(\"Int64\")\n",
    "    # Get rid of NANs so they don't cause problems when merging\n",
    "    df.fillna(\"\", inplace=True)\n",
    "\n",
    "    # Add base dataset with columns that will always have only one value\n",
    "    dfs = [df[keep_columns].drop_duplicates()]\n",
    "\n",
    "    # Merge values from each column in turn, creating a new dataframe from each\n",
    "    for column in merge_columns:\n",
    "        dfs.append(df.groupby([link_column])[column].apply(merge_column).reset_index())\n",
    "\n",
    "    # Merge all the individual dataframes into one, linking on `text_file` value\n",
    "    df_merged = reduce(\n",
    "        lambda left, right: pd.merge(left, right, on=[link_column], how=\"left\"), dfs\n",
    "    )\n",
    "    return df_merged"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f707cc4f-cb7a-4a99-8ed2-49092440716a",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "First we load harvested metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d2c2d79-015d-4bcc-9cfa-4cdf501fbdd1",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df = pd.read_json(\"oral-histories-metadata-files.ndjson\", lines=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be8c65a3-6f2f-4c60-bbb2-fd6607ef9af6",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Then we create lists of columns that have single value, and those that can have multiple values and will be merged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34f6decf-bd03-4978-ae8d-6bac17f9f9df",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Not digitised\n",
    "\n",
    "int_columns = [\"summary\", \"transcript\", \"sessions\", \"duration\"]\n",
    "keep_columns_nd = [\n",
    "    \"fulltext_url\",\n",
    "    \"work_url\",\n",
    "    \"summary\",\n",
    "    \"transcript\",\n",
    "    \"sessions\",\n",
    "    \"duration\",\n",
    "]\n",
    "merge_columns_nd = [\n",
    "    \"title\",\n",
    "    \"work_type\",\n",
    "    \"contributor\",\n",
    "    \"publisher\",\n",
    "    \"date\",\n",
    "    \"type\",\n",
    "    \"format\",\n",
    "    \"extent\",\n",
    "    \"language\",\n",
    "    \"subject\",\n",
    "    \"spatial\",\n",
    "    \"is_part_of\",\n",
    "    \"identifier\",\n",
    "    \"rights\",\n",
    "    \"fulltext_url_text\",\n",
    "    \"catalogue_url\",\n",
    "    \"audio_file_ids\",\n",
    "]\n",
    "\n",
    "# Digitised oral histories\n",
    "\n",
    "keep_columns_d = [\"fulltext_url\", \"summary\", \"transcript\", \"sessions\", \"duration\"]\n",
    "merge_columns_d = [\n",
    "    \"title\",\n",
    "    \"work_url\",\n",
    "    \"work_type\",\n",
    "    \"contributor\",\n",
    "    \"publisher\",\n",
    "    \"date\",\n",
    "    \"type\",\n",
    "    \"format\",\n",
    "    \"extent\",\n",
    "    \"language\",\n",
    "    \"subject\",\n",
    "    \"spatial\",\n",
    "    \"is_part_of\",\n",
    "    \"identifier\",\n",
    "    \"rights\",\n",
    "    \"fulltext_url_text\",\n",
    "    \"catalogue_url\",\n",
    "    \"audio_file_ids\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0f7d17e-cf1a-4130-8499-cd39ae2b5622",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "We merge the digitised and not-digitised oral histories separately, then combine the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d320afda-8595-49f5-b4b1-78846ffc7c89",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df_merged_digitised = merge_records(\n",
    "    df.copy().loc[df[\"fulltext_url\"] != \"\"],\n",
    "    int_columns,\n",
    "    keep_columns_d,\n",
    "    merge_columns_d,\n",
    "    \"fulltext_url\",\n",
    ")\n",
    "\n",
    "df_merged_not_digitised = merge_records(\n",
    "    df.copy().loc[df[\"fulltext_url\"] == \"\"],\n",
    "    int_columns,\n",
    "    keep_columns_nd,\n",
    "    merge_columns_nd,\n",
    "    \"work_url\",\n",
    ")\n",
    "\n",
    "df_merged = pd.concat([df_merged_not_digitised, df_merged_digitised])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a684367-38df-42bb-9bec-c2fe33aef4f9",
   "metadata": {
    "editable": true,
    "raw_mimetype": "",
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Save the merged dataset as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aee76576-9743-4dfb-b355-c2fca6de0a83",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df_merged[\n",
    "    [\n",
    "        \"title\",\n",
    "        \"contributor\",\n",
    "        \"publisher\",\n",
    "        \"date\",\n",
    "        \"type\",\n",
    "        \"format\",\n",
    "        \"extent\",\n",
    "        \"language\",\n",
    "        \"subject\",\n",
    "        \"spatial\",\n",
    "        \"is_part_of\",\n",
    "        \"identifier\",\n",
    "        \"rights\",\n",
    "        \"work_url\",\n",
    "        \"work_type\",\n",
    "        \"fulltext_url\",\n",
    "        \"fulltext_url_text\",\n",
    "        \"catalogue_url\",\n",
    "        \"summary\",\n",
    "        \"transcript\",\n",
    "        \"sessions\",\n",
    "        \"duration\",\n",
    "    ]\n",
    "].sort_values(\"title\").to_csv(\"trove-oral-histories.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c39dc82a-654a-46d8-86ab-9d27dd8331ae",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# FOR TESTING ONLY -- PLEASE IGNORE\n",
    "\n",
    "md_output = Path(\"test-metadata.ndjson\")\n",
    "md_output_head = Path(\"test-metadata-head.ndjson\")\n",
    "md_files_output = Path(\"test-metadata-files.ndjson\")\n",
    "\n",
    "# Do the API harvest!\n",
    "params = {\n",
    "    \"q\": 'nuc:ANL OR nuc:\"ANL:DL\"',\n",
    "    \"l-format\": \"Sound/Interview, lecture, talk\",\n",
    "}\n",
    "\n",
    "with s.cache_disabled():\n",
    "    harvest_works(params, filter_by=\"nuc\", output=md_output)\n",
    "\n",
    "# Create a subset for enriching\n",
    "with md_output.open(\"r\") as input_file:\n",
    "    head = [next(input_file) for _ in range(100)]\n",
    "\n",
    "with md_output_head.open(\"w\") as head_file:\n",
    "    for line in head:\n",
    "        head_file.write(line)\n",
    "\n",
    "# Enrich\n",
    "with s.cache_disabled():\n",
    "    enrich_metadata(input=md_output_head, output=md_files_output)\n",
    "\n",
    "df = pd.read_json(md_files_output, lines=True)\n",
    "\n",
    "df_merged_not_digitised = merge_records(\n",
    "    df.copy().loc[df[\"fulltext_url\"] == \"\"],\n",
    "    int_columns,\n",
    "    keep_columns_nd,\n",
    "    merge_columns_nd,\n",
    "    \"work_url\",\n",
    ")\n",
    "\n",
    "df_merged_digitised = merge_records(\n",
    "    df.copy().loc[df[\"fulltext_url\"] != \"\"],\n",
    "    int_columns,\n",
    "    keep_columns_d,\n",
    "    merge_columns_d,\n",
    "    \"fulltext_url\",\n",
    ")\n",
    "\n",
    "df_merged = pd.concat([df_merged_not_digitised, df_merged_digitised])\n",
    "\n",
    "md_output.unlink()\n",
    "md_output_head.unlink()\n",
    "md_files_output.unlink()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48b0f40a-18c5-413e-a975-477a9d6d623f",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.net/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  },
  "rocrate": {
   "author": [
    {
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "description": "Many of the National Library of Australia's oral histories are being made available online. This notebook harvests metadata describing the oral history collection from Trove and saves the results as a CSV file for further exploration.",
   "name": "Harvest oral histories metadata",
   "result": "https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}