{ "cells": [ { "cell_type": "markdown", "id": "electoral-stocks", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Harvest ABC Radio National records from Trove\n", "\n", "Trove harvests details of programs and segments broadcast on ABC Radio National. You can find them by [searching](https://trove.nla.gov.au/search/category/music?keyword=nuc%3A%22ABC%3ARN%22) for `nuc:\"ABC:RN\"` in the Music & Audio category. The records include basic metadata such as titles, dates, and contributors, but not full transcripts or audio.\n", "\n", "This notebook harvests metadata describing ABCRN programs and segments using the Trove API. Note that there don't seem to have been any additions to the data since early 2022.\n", "\n", "As of December 2023, there are **427,141** records (after removing duplicates) from about **163 programs** (the actual number of programs is less than this, as the names used for some programs varies). See [this notebook](explore-abcrn-data.ipynb) for some examples of how you can start exploring the data.\n", "\n", "The harvested data is available in this GitHub repository. You can download the full dataset as a **340mb [NDJSON file](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-metadata.ndjson)** (with a separate JSON object for each record, separated by line breaks) and as a **216mb [CSV file](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-metadata.csv)** (with lists saved as pipe-separated strings).\n", "\n", "For convenience, I've also created separate CSV files for the programs with the most records:\n", "\n", "* [RN Breakfast](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-breakfast-metadata.csv)\n", "* [RN Drive](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-drive-metadata.csv)\n", "* [AM](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-am-metadata.csv)\n", "* [PM](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-pm-metadata.csv)\n", "* [The World Today](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-worldtoday-metadata.csv)\n", "* [Late Night Live](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-latenight-metadata.csv)\n", "* [Life Matters](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-lifematters-metadata.csv)\n", "* [The Science Show](https://github.com/GLAM-Workbench/trove-abcrn-data/blob/main/abcrn-scienceshow-metadata.csv)\n", "\n", "There's also a [harvest from 2016](https://github.com/wragge/radio-national-data) available in this repository.\n", "\n", "## Data fields\n", "\n", "Any of the fields other than `work_id` and `version_id` might be empty, though in most cases there should at least be values for `title`, `date`, `creator`, `contributor` and `isPartOf`.\n", "\n", "* `work_id` – identifier for the containing work in Trove (you can use this to create a url to the item)\n", "* `version_id` – an identifier for the version within the work\n", "* `title` – title for the program or segment\n", "* `isPartOf` – name of the program this is a part of\n", "* `date` – ISO formatted date\n", "* `creator` – usually just the ABC\n", "* `contributor` – a list of names of those involved, such as the host, reporter or guest\n", "* `type` – list of types\n", "* `format` – list of formats\n", "* `abstract` – text providing a summary of the program or segment (may incude multiple values)\n", "* `fulltext_url` – link to the page on the ABC website where you can find more information\n", "* `thumbnail_url` – link to a related thumbnail image on the ABC website\n", "* `notonline_url` – not sure..." ] }, { "cell_type": "markdown", "id": "simplified-brighton", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "id": "cooked-vietnamese", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "import os\n", "from datetime import datetime\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from dotenv import load_dotenv\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "load_dotenv()" ] }, { "cell_type": "code", "execution_count": 2, "id": "intimate-momentum", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "id": "advanced-diary", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 5, "id": "gentle-arabic", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_total(params):\n", " params[\"n\"] = 0\n", " response = s.get(\"https://api.trove.nla.gov.au/v3/result\", params=params)\n", " data = response.json()\n", " return int(data[\"category\"][0][\"records\"][\"total\"])\n", "\n", "\n", "def get_metadata_source(record):\n", " try:\n", " source = record[\"metadataSource\"][\"value\"]\n", " except TypeError:\n", " source = record[\"metadataSource\"]\n", " return source\n", "\n", "\n", "def extract_values(value, key=\"value\"):\n", " \"\"\"\n", " Some fields mix dicts and lists. Try to extract values from dicts and return only lists.\n", " \"\"\"\n", " values = []\n", " value_list = [v for v in value if v]\n", " for v in value_list:\n", " try:\n", " values.append(v[key].strip())\n", " except (TypeError, KeyError):\n", " values.append(v.strip())\n", " return values\n", "\n", "\n", "def get_links(identifiers):\n", " \"\"\"\n", " Flatten the identifiers list of dicts into a dict with linktype as key.\n", " \"\"\"\n", " links = {}\n", " for link in identifiers:\n", " try:\n", " links[f'{link[\"linktype\"]}_url'] = link[\"value\"]\n", " except (TypeError, KeyError):\n", " pass\n", " return links\n", "\n", "\n", "def harvest(output_file=None, year=None):\n", " Path(\"data\").mkdir(exist_ok=True)\n", " if not output_file:\n", " output_file = f'abcrn-{datetime.now().strftime(\"%Y%m%d\")}.ndjson'\n", " output_file = Path(\"data\", output_file)\n", " params = {\n", " \"q\": 'nuc:\"ABC:RN\"',\n", " \"category\": \"music\",\n", " \"include\": \"workversions\",\n", " \"n\": 100,\n", " \"bulkHarvest\": \"true\",\n", " \"encoding\": \"json\",\n", " \"key\": API_KEY,\n", " }\n", " if year:\n", " params[\"l-year\"] = year\n", " params[\"l-decade\"] = year[:3]\n", " start = \"*\"\n", " total = get_total(params.copy())\n", "\n", " with output_file.open(\"w\") as data_file:\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\", params=params\n", " )\n", " data = response.json()\n", " # Loop through the work records\n", " records = data[\"category\"][0][\"records\"][\"work\"]\n", " for record in records:\n", " # Now loop through the version records\n", " for version in record[\"version\"]:\n", " # Sometimes versions can themselves contain multiple records and ids\n", " # First we'll try splitting the ids in case there are multiple values\n", " ids = version[\"id\"].split()\n", " # Then we'll try looping through any sub-version records\n", " for i, subr in enumerate(version[\"record\"]):\n", " # Get the metadata source so we can filter out any records we don't want\n", " subv = subr[\"metadata\"][\"dc\"]\n", " source = get_metadata_source(subr)\n", " if source == \"ABC:RN\":\n", " # Add work id to the record\n", " metadata = {\n", " \"work_id\": record[\"id\"],\n", " \"version_id\": ids[i],\n", " \"title\": extract_values(subv[\"title\"]),\n", " \"date\": extract_values(subv[\"date\"]),\n", " \"isPartOf\": extract_values(subv[\"isPartOf\"]),\n", " \"creator\": extract_values(\n", " subv[\"creator\"], key=\"name\"\n", " ),\n", " \"contributor\": extract_values(subv[\"contributor\"]),\n", " \"abstract\": extract_values(subv[\"abstract\"]),\n", " \"type\": extract_values(subv[\"type\"]),\n", " \"format\": extract_values(subv[\"format\"]),\n", " }\n", " # Get links by flattening the identifiers field and add to record\n", " links = get_links(subv[\"identifier\"])\n", " metadata.update(links)\n", " # remove unnecessary identifiers field\n", " data_file.write(f\"{json.dumps(metadata)}\\n\")\n", " try:\n", " start = data[\"category\"][0][\"records\"][\"nextStart\"]\n", " except KeyError:\n", " start = None\n", " pbar.update(len(records))" ] }, { "cell_type": "markdown", "id": "eastern-trailer", "metadata": {}, "source": [ "## Harvest the data!" ] }, { "cell_type": "code", "execution_count": 6, "id": "engaged-salon", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "86aa2d98805b4db4900a7031dd65ec1e", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/438838 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "output_file = f'abcrn-{datetime.now().strftime(\"%Y%m%d\")}.ndjson'\n", "\n", "harvest(output_file=output_file)" ] }, { "cell_type": "markdown", "id": "opened-lindsay", "metadata": {}, "source": [ "## Remove duplicate records\n", "\n", "How many records have we harvested? Let's load the `ndjson` file into a dataframe and explore." ] }, { "cell_type": "code", "execution_count": 7, "id": "uniform-parks", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/html": [ "
\n", " | work_id | \n", "version_id | \n", "title | \n", "date | \n", "isPartOf | \n", "creator | \n", "contributor | \n", "abstract | \n", "type | \n", "format | \n", "fulltext_url | \n", "thumbnail_url | \n", "notonline_url | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "14882967 | \n", "195385238 | \n", "[RU 486] | \n", "[1997-09-22] | \n", "[ABC Radio National. Health Report] | \n", "[Australian Broadcasting Corporation. Radio Na... | \n", "[Dr Norman Swan] | \n", "[What politicians believe is good for women's ... | \n", "[Sound, Transcript, Radio Broadcast] | \n", "[text/html, Transcript] | \n", "http://www.abc.net.au/radionational/programs/h... | \n", "http://www.abc.net.au/radionational/image/3699... | \n", "NaN | \n", "
1 | \n", "151422764 | \n", "195400866 | \n", "[Copyright and the courts] | \n", "[2011-05-12] | \n", "[ABC Radio National. Law Report] | \n", "[Australian Broadcasting Corporation. Radio Na... | \n", "[David, Sabiene Heindl, Jock Given, Ross Steve... | \n", "[There's an on-going courtroom war between cop... | \n", "[Sound, Transcript, Radio Broadcast] | \n", "[Audio, Transcript] | \n", "http://www.abc.net.au/radionational/programs/l... | \n", "http://www.abc.net.au/radionational/image/3699... | \n", "NaN | \n", "
2 | \n", "15426408 | \n", "206893518 | \n", "[The Law Report] | \n", "[2014-03-25] | \n", "[ABC Radio National. RN Breakfast] | \n", "[Australian Broadcasting Corporation. Radio Na... | \n", "[Damien Carrick, Fran Kelly] | \n", "[Disability rights lawyer and endurance athlet... | \n", "[Sound, Transcript, Radio Broadcast] | \n", "[text/html] | \n", "http://www.abc.net.au/radionational/programs/b... | \n", "http://www.abc.net.au/radionational/image/3699... | \n", "NaN | \n", "
3 | \n", "15426408 | \n", "206591783 | \n", "[The Law Report] | \n", "[2014-02-11] | \n", "[ABC Radio National. RN Breakfast] | \n", "[Australian Broadcasting Corporation. Radio Na... | \n", "[Damien Carrick, Fran Kelly] | \n", "[Professor Andrew Ashworth, one of the United ... | \n", "[Sound, Transcript, Radio Broadcast] | \n", "[text/html] | \n", "http://www.abc.net.au/radionational/programs/b... | \n", "http://www.abc.net.au/radionational/image/3699... | \n", "NaN | \n", "
4 | \n", "156082218 | \n", "209405411 | \n", "[East Timor Since Independence] | \n", "[2006-06-29] | \n", "[ABC Radio National. Rear Vision] | \n", "[Australian Broadcasting Corporation. Radio Na... | \n", "[Dr Dennis Shoesmith, Rob Wesley Smith, James ... | \n", "[What has happened in East Timor since indepen... | \n", "[Text, Transcript, Radio Broadcast] | \n", "[Audio] | \n", "http://www.abc.net.au/radionational/programs/r... | \n", "http://www.abc.net.au/radionational/image/3699... | \n", "NaN | \n", "