{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Download the OCRd text for ALL the digitised periodicals in Trove!\n", "\n", "Many of the digitised periodicals available in Trove make OCRd text available for download. This notebook helps you download all the OCRd text from all (or most of?) Trove's digitised periodicals, creating ­one text file for each issue. It also saves a CSV-formatted list of the issues in each periodical. If you want to harvest all the text of a *single* periodical, see [Get OCRd text from a digitised journal in Trove](Get-text-from-a-Trove-journal.ipynb).\n", "\n", "While you can download the complete text of an issue using the web interface, there's no option to do this with the API alone. The workaround is to mimic the web interface by constructing a download link using the issue identifier and number of pages. \n", "\n", "This notebook works through the dataset of digitised periodical issues created by [Enrich the list of periodicals from the Trove API](periodicals-enrich-for-datasette.ipynb), requesting and saving the OCRd text for each issue. However, not every issue has text available to download." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Let's import the libraries we need.\n", "import os\n", "import re\n", "import shutil\n", "import time\n", "from datetime import timedelta\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import requests\n", "import requests_cache\n", "from dotenv import load_dotenv\n", "from requests.adapters import HTTPAdapter\n", "from requests.exceptions import ConnectionError, Timeout\n", "from requests.packages.urllib3.util.retry import Retry\n", "from slugify import slugify\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession(expire_after=timedelta(days=30))\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "load_dotenv()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def check_for_file(issue, texts_path):\n", " issue_date = issue[\"date\"] if issue[\"date\"] else \"nd\"\n", " file_name = f\"{issue_date}-{issue['id']}.txt\"\n", " file_path = Path(texts_path, file_name)\n", " if file_path.exists():\n", " return file_name\n", " return \"\"\n", "\n", "\n", "def download_issue(issue_id, last_page, file_path):\n", " url = f\"https://trove.nla.gov.au/{issue_id}/download?downloadOption=ocr&firstPage=0&lastPage={last_page}\"\n", " # print(url)\n", " # Get the file\n", " try:\n", " response = s.get(url, timeout=180)\n", " except (Timeout, ConnectionError) as err:\n", " print(f\"{type(err).__name__}: {url}\")\n", " else:\n", " # Check there was no error\n", " if response.status_code == requests.codes.ok:\n", " # Check that the file's not empty\n", " response.encoding = \"utf-8\"\n", " # Check that the file isn't HTML (some not found pages don't return 404s)\n", " # BS is too lax and will pass text files that happen to have html tags in them\n", " # if BeautifulSoup(r.text, \"html.parser\").find(\"html\") is None:\n", " if (\n", " len(response.text) > 0\n", " and not response.text.isspace()\n", " and not re.search(r\"\", response.text, re.IGNORECASE)\n", " ):\n", " file_path.write_text(response.text)\n", " if not response.from_cache:\n", " time.sleep(0.5)\n", "\n", "\n", "def download_all_issues(df, output_dir=\"periodicals\"):\n", " # Group issues by title, then loop trhough the titles/issues\n", " for title, issues in tqdm(df.groupby([\"title_id\", \"title\"])):\n", " output_path = Path(output_dir, f\"{slugify(title[1])[:50]}-{title[0]}\")\n", " texts_path = Path(output_path, \"texts\")\n", " texts_path.mkdir(exist_ok=True, parents=True)\n", " issues_with_pages = issues.loc[issues[\"pages\"] > 0]\n", " for issue in tqdm(\n", " issues_with_pages.itertuples(),\n", " total=issues_with_pages.shape[0],\n", " leave=False,\n", " ):\n", " last_page = issue.pages - 1\n", " issue_date = issue.date if issue.date else \"nd\"\n", " file_name = f\"{issue_date}-{issue.id}.txt\"\n", " file_path = Path(texts_path, file_name)\n", " # Check to see if the file has already been harvested\n", " if not file_path.exists():\n", " download_issue(issue.id, last_page, file_path)\n", " issues[\"text_file\"] = issues.apply(check_for_file, args=(texts_path,), axis=1)\n", " issues.to_csv(Path(output_path, \"issues.csv\"), index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Load issues dataset\n", "df = pd.read_csv(\n", " \"https://github.com/GLAM-Workbench/trove-periodicals-data/raw/main/periodical-issues.csv\",\n", " keep_default_na=False,\n", ")\n", "\n", "# Download texts\n", "download_all_issues(df)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Results for each journal are saved in a separate directory in the output directory (which defaults to `periodicals`). The name of the journal directory is created using the journal title and journal id. Inside this directory is a CSV formatted file containing details of all the available issues, and a `texts` sub-directory to contain the downloaded text files.\n", "\n", "The individual file names are created using the journal title, issue details, and issue identifier. So the resulting hierarchy might look something like this:\n", "\n", "```\n", "periodicals\n", " - angry-penguins-nla.obj-320790312\n", " - issues.csv\n", " - texts\n", " - angry-penguins-broadsheet-no-1-nla.obj-320791009.txt\n", "```\n", "\n", "The `issues.csv` will contain details of all the issues in the journal. Its structure is the same as the `periodical-issues.csv` loaded from the [trove-periodicals-data](https://github.com/GLAM-Workbench/trove-periodicals-data) repository, except that a new `text_file` column is added. If text was successfully from an issue, the `test_file` column will include the name of the saved text file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "jupyter": { "source_hidden": true }, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# IGNORE THIS CELL -- FOR TESTING ONLY\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " df = pd.read_csv(\n", " \"https://github.com/GLAM-Workbench/trove-periodicals-data/raw/main/periodical-issues.csv\",\n", " keep_default_na=False,\n", " )\n", "\n", "download_all_issues(df[:10], output_dir=\"texts_test\")\n", "shutil.rmtree(\"texts_test\")" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).\n", "\n", "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "action": [ { "local_path": "periodicals", "mainEntityOfPage": "https://glam-workbench.net/trove-journals/ocrd-text-all-journals/", "name": "OCRd text from Trove's digitised periodicals", "object": [ { "url": "https://github.com/GLAM-Workbench/trove-periodicals-data/raw/main/periodical-issues.csv" } ], "result": [ { "description": "This dataset contains OCRd text and metadata harvested from digitised periodicals in Trove.\n\nThe zip file contains a directory for each periodical which is named using it's title and identifier, eg: `14th-company-magazine-nla.obj-15956697`. Each directory contains a CSV-formatted list of issues and a subdirectory named `texts` that contains a text file for each issue with OCRd text, The text files are named using the issue's date and identifier, eg: `1918-06-14-nla.obj-15967449.txt`. If text was successfully downloaded from an issue, the `issues.csv` file will inlcude the name of the text file in the `text_file` column.", "url": "https://trove-journals.s3.ap-southeast-2.amazonaws.com/trove-periodicals.zip" } ] } ], "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "category": "Harvesting text", "description": "This notebook helps you download all the OCRd text from all (or most of?) Trove's digitised periodicals, creating ­one text file for each issue. It also saves a CSV-formatted list of the issues in each periodical.", "mainEntityOfPage": "https://glam-workbench.net/trove-journals/get-ocrd-text-from-all-journals/", "name": "Download the OCRd text for ALL the digitised periodicals in Trove!", "position": 5 } }, "nbformat": 4, "nbformat_minor": 4 }