{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5b7277b7-1e6a-47b9-ae17-56f0d973c9c1",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# Download summaries and transcripts from oral histories\n",
    "\n",
    "If oral histories have summaries or transcripts, they can be downloaded as text or PDF files using their `nla.obj` identifiers. See [Accessing data from digitised oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/accessing-data.html#transcripts-and-summaries) in the Trove Data Guide for more details.\n",
    "\n",
    "This notebook downloads all the available transcripts and summaries from digitised oral histories available in Trove. It uses a [pre-harvested dataset](https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv) of oral histories to obtain a list of `nla.obj` identifiers. It then constructs a download url using each identifier, and downloads the file.\n",
    "\n",
    "If you're using data from the oral histories in Trove, you should read the section on [licensing of oral histories](https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html#licensing-of-oral-histories) in the Trove Data Guide."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6152dd49-77f4-4bf5-9897-7392a4d7d0a1",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import re\n",
    "import time\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import requests_cache\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b1b4b8c-2e2e-4d5e-ae3e-e192a6d0d997",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "s = requests_cache.CachedSession(timeout=60)\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94c06952-1544-4bd1-8da0-1bad35b25164",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def download_transcripts(output_dir=\"transcripts\", max=None):\n",
    "    # Create a directory to save the transcripts\n",
    "    output_path = Path(output_dir)\n",
    "    output_path.mkdir(exist_ok=True)\n",
    "\n",
    "    # Load the pre-harvested dataset\n",
    "    df = pd.read_csv(\"https://github.com/GLAM-Workbench/trove-oral-histories-data/raw/main/trove-oral-histories.csv\", keep_default_na=False)\n",
    "\n",
    "    # Filter to records that have either a transcript or summary (or both)\n",
    "    transcripts = df.loc[(df[\"summary\"] == 1) | (df[\"transcript\"] == 1)]\n",
    "\n",
    "    # Loop through a list of fulltext_url values from the filtered dataset\n",
    "    for ts in tqdm(transcripts[\"fulltext_url\"].to_list()[:max]):\n",
    "        # Extract the nla.obj id from the fulltext url\n",
    "        ts_id = re.search(r\"nla\\.obj-\\d+\", ts).group(0)\n",
    "\n",
    "        # Construct a download url\n",
    "        ts_url = f\"https://nla.gov.au/tarkine/listen/download/transcript/{ts_id}\"\n",
    "\n",
    "        # Download and save the text file\n",
    "        response = s.get(ts_url)\n",
    "        with Path(output_path, f\"{ts_id}.txt\").open(\"w\") as text_file:\n",
    "            text_file.write(response.text)\n",
    "\n",
    "        # Pause if necessary\n",
    "        if not response.from_cache:\n",
    "            time.sleep(0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c65124de-1153-4c60-8d4b-ae04a51c4148",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "download_transcripts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6024bcb8-8b2a-4855-9618-871bb1aa5cb2",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# TESTING -- PLEASE IGNORE\n",
    "\n",
    "with s.cache_disabled():\n",
    "    download_transcripts(max=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2799acb6-8525-4e30-b73c-6d6af25e82ae",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.net/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  },
  "rocrate": {
   "author": [
    {
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "description": "This notebook downloads all the available transcripts and summaries from digitised oral histories available in Trove. It uses a pre-harvested dataset of oral histories to obtain a list of `nla.obj` identifiers. It then constructs a download url using each identifier, and downloads the file.",
   "name": "Download summaries and transcripts from oral histories",
   "object": "https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}