{ "cells": [ { "cell_type": "markdown", "id": "5b7277b7-1e6a-47b9-ae17-56f0d973c9c1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Download summaries and transcripts from oral histories\n", "\n", "If oral histories have summaries or transcripts, they can be downloaded as text or PDF files using their `nla.obj` identifiers. See [Accessing data from digitised oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/accessing-data.html#transcripts-and-summaries) in the Trove Data Guide for more details.\n", "\n", "This notebook downloads all the available transcripts and summaries from digitised oral histories available in Trove. It uses a [pre-harvested dataset](https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv) of oral histories to obtain a list of `nla.obj` identifiers. It then constructs a download url using each identifier, and downloads the file.\n", "\n", "If you're using data from the oral histories in Trove, you should read the section on [licensing of oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html#licensing-of-oral-histories) in the Trove Data Guide." ] }, { "cell_type": "code", "execution_count": null, "id": "6152dd49-77f4-4bf5-9897-7392a4d7d0a1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import re\n", "import time\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "code", "execution_count": null, "id": "6b1b4b8c-2e2e-4d5e-ae3e-e192a6d0d997", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "s = requests_cache.CachedSession(timeout=60)\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": null, "id": "94c06952-1544-4bd1-8da0-1bad35b25164", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def download_transcripts(output_dir=\"transcripts\", max=None):\n", " # Create a directory to save the transcripts\n", " output_path = Path(output_dir)\n", " output_path.mkdir(exist_ok=True)\n", "\n", " # Load the pre-harvested dataset\n", " df = pd.read_csv(\"https://github.com/GLAM-Workbench/trove-oral-histories-data/raw/main/trove-oral-histories.csv\", keep_default_na=False)\n", "\n", " # Filter to records that have either a transcript or summary (or both)\n", " transcripts = df.loc[(df[\"summary\"] == 1) | (df[\"transcript\"] == 1)]\n", "\n", " # Loop through a list of fulltext_url values from the filtered dataset\n", " for ts in tqdm(transcripts[\"fulltext_url\"].to_list()[:max]):\n", " # Extract the nla.obj id from the fulltext url\n", " ts_id = re.search(r\"nla\\.obj-\\d+\", ts).group(0)\n", "\n", " # Construct a download url\n", " ts_url = f\"https://nla.gov.au/tarkine/listen/download/transcript/{ts_id}\"\n", "\n", " # Download and save the text file\n", " response = s.get(ts_url)\n", " with Path(output_path, f\"{ts_id}.txt\").open(\"w\") as text_file:\n", " text_file.write(response.text)\n", "\n", " # Pause if necessary\n", " if not response.from_cache:\n", " time.sleep(0.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "c65124de-1153-4c60-8d4b-ae04a51c4148", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "download_transcripts()" ] }, { "cell_type": "code", "execution_count": null, "id": "6024bcb8-8b2a-4855-9618-871bb1aa5cb2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# TESTING -- PLEASE IGNORE\n", "\n", "with s.cache_disabled():\n", " download_transcripts(max=10)" ] }, { "cell_type": "markdown", "id": "2799acb6-8525-4e30-b73c-6d6af25e82ae", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.net/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "rocrate": { "author": [ { "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "description": "This notebook downloads all the available transcripts and summaries from digitised oral histories available in Trove. It uses a pre-harvested dataset of oral histories to obtain a list of `nla.obj` identifiers. It then constructs a download url using each identifier, and downloads the file.", "name": "Download summaries and transcripts from oral histories", "object": "https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv" } }, "nbformat": 4, "nbformat_minor": 5 }