{ "cells": [ { "cell_type": "markdown", "id": "555cf3b9-3380-4208-863e-1624499063bf", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Create a database to search across each line of text in a series of volumes\n", "\n", "The code here was used to create the [NSW Post Office Directories search interface](https://nsw-post-office-directories-yajhxrvxsa-ts.a.run.app/) which helps you search across 54 volumes from 1886 to 1950. The same code, with minor modifications, could be used to index any publication where it would be useful to search by line (rather than Trove's default 'article') – for example, lists, directories and gazetteers – turning them into searchable databases..\n", "\n", "All you really need to get started is the top level identifier – that is, the Trove identifier for the publication series, under which individual volumes are listed. Where noted in the code, you'll also want to change a few names and details.\n", "\n", "Because the harvesting process can be quite slow, and the amount of data processed quite large, you would probably want to run this notebook within a persistent environment rather than on Binder – either on your own computer, or in a cloud service like Reclaim Cloud or Nectar." ] }, { "cell_type": "code", "execution_count": null, "id": "afd5cbdd-2dd5-4f3b-ae89-e611d6ec2c10", "metadata": {}, "outputs": [], "source": [ "# Let's import the libraries we need.\n", "import json\n", "import re\n", "import time\n", "from pathlib import Path\n", "\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from natsort import natsorted, ns\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from slugify import slugify\n", "from sqlite_utils import Database\n", "from tqdm.auto import tqdm\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "id": "988f516c-fd75-45e2-a40c-745549be7cb9", "metadata": {}, "source": [ "## Get the data!\n", "\n", "The first step is to download all of the OCRd text for each volume in the series. If they're large volumes this could take a long time." ] }, { "cell_type": "code", "execution_count": null, "id": "b9d2f932-6c1e-4a3d-95a0-2512335d4bf2", "metadata": {}, "outputs": [], "source": [ "def harvest_metadata(obj_id):\n", " \"\"\"\n", " This calls an internal API from a journal landing page to extract a list of available issues.\n", " \"\"\"\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " issues = []\n", " with tqdm(desc=\"Issues\", leave=False) as pbar:\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(\n", " f\"https://nla.gov.au/{obj_id}/browse?startIdx={start}&rows=20&op=c\",\n", " timeout=60,\n", " )\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_=\"l-item-info\")\n", " for detail in details:\n", " issue = {}\n", " title = detail.find(\"h3\")\n", " if title:\n", " issue[\"title\"] = title.text\n", " issue[\"id\"] = title.parent[\"href\"].strip(\"/\")\n", " else:\n", " issue[\"title\"] = \"No title\"\n", " issue[\"id\"] = detail.find(\"a\")[\"href\"].strip(\"/\")\n", " try:\n", " # Get the issue details\n", " issue[\"details\"] = detail.find(\n", " class_=\"obj-reference content\"\n", " ).string.strip()\n", " except (AttributeError, IndexError):\n", " issue[\"details\"] = \"issue\"\n", " # Get the number of pages\n", " try:\n", " issue[\"pages\"] = int(\n", " re.search(\n", " r\"^(\\d+)\",\n", " detail.find(\"a\", attrs={\"data-pid\": issue[\"id\"]}).text,\n", " flags=re.MULTILINE,\n", " ).group(1)\n", " )\n", " except AttributeError:\n", " issue[\"pages\"] = 0\n", " issues.append(issue)\n", " # print(issue)\n", " time.sleep(0.2)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " pbar.update(n)\n", " return issues\n", "\n", "\n", "def save_ocr(issues, obj_id, title=None, output_dir=\"directories\"):\n", " \"\"\"\n", " Download the OCRd text for each page in each issue.\n", " \"\"\"\n", " if not title:\n", " title = issues[0][\"title\"]\n", " # output_path = os.path.join(output_dir, '{}-{}'.format(slugify(title)[:50], obj_id))\n", " # output_path = Path(output_dir, f\"{slugify(title)[:50]}-{obj_id}\")\n", " # output_path.mkdir(exist_ok=True)\n", " # texts_path = os.path.join(output_path, 'texts')\n", " # os.makedirs(texts_path, exist_ok=True)\n", " for issue in tqdm(issues, desc=\"Texts\", leave=False):\n", " # Default values\n", " if issue[\"pages\"] != 0:\n", " issue_name = f'{slugify(issue[\"title\"])[:50]}-{slugify(issue[\"details\"])}-{issue[\"id\"]}'\n", " issue_path = Path(output_dir, issue_name)\n", " issue_path.mkdir(exist_ok=True, parents=True)\n", " for page in tqdm(range(0, issue[\"pages\"])):\n", " # file_name = '{}-{}-{}.txt'.format(slugify(issue['title'])[:50], slugify(issue['details'])[:50], issue['id'])\n", "\n", " file_path = Path(issue_path, f\"{issue_name}-page{page}.txt\")\n", " # Check to see if the file has already been harvested\n", " if not file_path.exists():\n", " url = f'https://trove.nla.gov.au/{issue[\"id\"]}/download?downloadOption=ocr&firstPage={page}&lastPage={page}'\n", " # print(url)\n", " # Get the file\n", " r = s.get(url, timeout=120)\n", " # Check there was no error\n", " if r.status_code == requests.codes.ok:\n", " # Check that the file's not empty\n", " r.encoding = \"utf-8\"\n", " if len(r.text) > 0 and not r.text.isspace():\n", " # Check that the file isn't HTML (some not found pages don't return 404s)\n", " if (\n", " BeautifulSoup(r.text, \"html.parser\").find(\"html\")\n", " is None\n", " ):\n", " # If everything's ok, save the file\n", " with open(\n", " file_path, \"w\", encoding=\"utf-8\"\n", " ) as text_file:\n", " text_file.write(r.text)\n", " time.sleep(1)" ] }, { "cell_type": "markdown", "id": "f378b3ad-8d25-48c5-b0ae-e0cf0d881bd5", "metadata": {}, "source": [ "Get a list of available issues for this title." ] }, { "cell_type": "code", "execution_count": null, "id": "f9416332-58c1-4ba8-b8ff-6d52b11378b7", "metadata": {}, "outputs": [], "source": [ "issues = harvest_metadata(\"nla.obj-522689844\")" ] }, { "cell_type": "markdown", "id": "38730d50-c382-41f5-a856-075970b77920", "metadata": {}, "source": [ "Save the OCRd text of each issue page by page. This means there'll be one text file for every page in every volume." ] }, { "cell_type": "code", "execution_count": null, "id": "cf7fc29e-950c-465e-8fce-3f41a67599ae", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "save_ocr(issues, \"nla.obj-522689844\")" ] }, { "cell_type": "markdown", "id": "c36864a6-abc8-4364-a109-9d7ddc50bc85", "metadata": {}, "source": [ "## Assemble the database!\n", "\n", "Now we've got all the text, we can process each line and add everything into an SQLite database." ] }, { "cell_type": "code", "execution_count": null, "id": "1b967e0c-a2b7-40d6-b201-14183ba6be81", "metadata": {}, "outputs": [], "source": [ "def get_work_metadata(obj_id):\n", " \"\"\"\n", " Extracts metadata embedded as a JSON string in a work's HTML page.\n", " See: https://glam-workbench.net/trove-books/metadata-for-digital-works/\n", " \"\"\"\n", " # Get the HTML page\n", " response = requests.get(f\"https://nla.gov.au/{obj_id}\")\n", " # Search for the JSON string using regex\n", " try:\n", " work_data = re.search(\n", " r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n", " ).group(1)\n", " except AttributeError:\n", " # Just in case it's not there...\n", " work_data = \"{}\"\n", " print(\"No data found!\")\n", " # Return the JSON data\n", " return json.loads(work_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "73f3babc-a8c4-417f-9ccb-78c2c3ed64cd", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "db_path = Path(\"datasette-test\")\n", "db_path.mkdir(exist_ok=True)\n", "\n", "# This the basic metadata file that will be used by Datasette\n", "# The processing steps below will add details for each table/volume into the metadata file\n", "# Obviously you'd modify this for a different publication!\n", "\n", "metadata = {\n", " \"title\": \"New South Wales Post Office Directories\",\n", " \"description_html\": \"

This is an experimental interface to the NSW Post Office Directories, which have been digitised from the collections of the State Library of NSW and are now available on Trove.

Searching for entries in the directories on Trove is difficult, because Trove's results are grouped by 'article', which is not very helpful for publications like these where information is organised by row. This interface searches for individual lines of text, rather than pages or articles. So it points you straight to entries of interest. Once you've found something, you can view the entry within the context of the complete page, or click back to Trove to explore further.

You can currently search across 15 volumes from 1886 to 1908.\",\n", " \"databases\": {\n", " \"post-office-directories\": {\n", " \"title\": \"New South Wales Post Office Directories, 1886 to 1908\",\n", " \"source\": \"Trove\",\n", " \"source_url\": \"https://nla.gov.au/nla.obj-518308191\",\n", " \"tables\": {},\n", " }\n", " },\n", "}" ] }, { "cell_type": "markdown", "id": "1c1c10d8-25cd-4cb7-b904-fe9a85cba5fa", "metadata": {}, "source": [ "Now we'll process the data and add it to an SQLite database." ] }, { "cell_type": "code", "execution_count": null, "id": "360267d7-8329-4632-8664-971b7e043bef", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Create the database\n", "# Change the name as apporpriate!\n", "db = Database(Path(db_path, \"post-office-directories.db\"))\n", "\n", "# Create database tables for pages and volumes\n", "page_table = db[\"pages\"]\n", "vols_table = db[\"volumes\"]\n", "\n", "# Loop through the directories of each volume created by the harvesting process (above)\n", "# Use natsorted so that they're processed in date order\n", "vols = natsorted(\n", " [\n", " d\n", " for d in Path(\"directories\").glob(\n", " \"the-new-south-wales-post-office-directory-collection*\"\n", " )\n", " if d.is_dir()\n", " ]\n", ")\n", "\n", "for vol in vols:\n", " print(vol)\n", "\n", " # In the case of the PO Directories each volume has a different year (or year range) in the title.\n", " # Here we're extracting the year and id, but extracting the year in this way might not work for other titles.\n", " # The year is used as the title of the table in Datasette\n", " year, obj_id = re.search(r\"-([0-9\\-]+?)-(nla.obj-\\d+)\", vol.name).groups()\n", "\n", " # Add a record for this volume to the database\n", " vols_table.insert({\"vol_id\": obj_id, \"year\": year}, pk=\"vol_id\")\n", "\n", " # Get the embedded JSON data for this volume and extract a list of pages\n", " work_data = get_work_metadata(obj_id)\n", " work_pages = work_data[\"children\"][\"page\"]\n", "\n", " # Get the title from the work metadata\n", " title = work_data[\"title\"]\n", "\n", " # Update the metadata file with details of this volume\n", " metadata[\"databases\"][\"post-office-directories\"][\"tables\"][year] = {\n", " \"title\": f\"{title} {year}\",\n", " \"source\": \"Trove\",\n", " \"source_url\": f\"https://nla.gov.au/{obj_id}\",\n", " \"searchmode\": \"raw\",\n", " }\n", "\n", " # Create a table for this volume. For the PO directories I'm using the year as the table name.\n", " # If year isn't available, some other way of naming the table would be necessary, such as the full title.\n", " vol_table = db[year]\n", "\n", " # Loop through all the text page by page for this volume\n", " pages = natsorted([p for p in vol.glob(\"*\") if p.is_file()], alg=ns.PATH)\n", " for page in pages:\n", " lines = []\n", "\n", " # Insert details about this page into the pages table\n", " page_num = int(re.search(r\"-page(\\d+)$\", page.name).group(1))\n", " page_table.insert(\n", " {\n", " \"vol_id\": obj_id,\n", " \"page_id\": work_pages[page_num][\"pid\"],\n", " \"page\": page_num,\n", " },\n", " pk=(\"vol_id\", \"page\"),\n", " foreign_keys=[(\"vol_id\", \"volumes\")],\n", " )\n", "\n", " # Open the text file and loop through the lines\n", " with page.open(\"r\") as txt:\n", " for num, line in enumerate(txt):\n", " # Get rid of blank lines\n", " line = line.replace(\"\\n\", \"\")\n", " # If line is not blank, add details to a list of lines from this page\n", " if line:\n", " lines.append(\n", " {\n", " \"page\": page_num,\n", " \"line\": num + 1,\n", " \"text\": line,\n", " \"page_id\": work_pages[page_num][\"pid\"],\n", " }\n", " )\n", " # Insert all the lines from this page into the db\n", " vol_table.insert_all(\n", " lines, pk=(\"page\", \"line\"), foreign_keys=[(\"page_id\", \"pages\", \"page_id\")]\n", " )\n", " # Add a full text index on the line text\n", " vol_table.enable_fts([\"text\"])\n", "\n", "# Save the updated metadata file\n", "with open(Path(db_path, \"metadata.json\"), \"w\") as md_file:\n", " json.dump(metadata, md_file, indent=4)" ] }, { "cell_type": "markdown", "id": "c2080f86-6a82-43b2-9e6f-29f7297abc9f", "metadata": {}, "source": [ "## Opening the db in Datasette" ] }, { "cell_type": "markdown", "id": "57ef2e22-fa52-48b2-9ddc-237ea2d0a150", "metadata": {}, "source": [ "You should now have an SQLite database containing the indexed text of all the volumes in the collection. You can explore the database using Datasette. In a Python environment you'll need to install:\n", "\n", "* [Datasette](https://datasette.io/)\n", "* [datasette-search-all](https://github.com/simonw/datasette-search-all)\n", "* [datasette-template-sql](https://datasette.io/plugins/datasette-template-sql)\n", "\n", "Then, from within the directory containing the database, run:\n", "\n", "```shell\n", "datasette post-office-directories.db -m metadata.json\n", "```\n", "\n", "To share your database publicly, look at [Datasette's publishing options](https://docs.datasette.io/en/stable/publish.html)." ] }, { "cell_type": "markdown", "id": "eab2e43c-71d4-43b7-a9a2-ac6b7eacf95c", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/) as part of the [Everyday Heritage](https://everydayheritage.au/) project." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "category": "Harvesting text", "description": "The code here was used to create the NSW Post Office Directories search interface which helps you search across 54 volumes from 1886 to 1950. The same code, with minor modifications, could be used to index any publication where it would be useful to search by line (rather than Trove's default 'article') – for example, lists, directories and gazetteers – turning them into searchable databases.", "mainEntityOfPage": "https://glam-workbench.net/trove-journals/create-text-db-indexed-by-line/", "name": "Create a database to search across each line of text in a series of volumes", "position": 6, "workExample": [ { "name": "NSW Post Office Directories, 1886 to 1950", "url": "https://glam-workbench.net/nsw-post-office-directories/" }, { "name": "Sydney Telephone Directories, 1926 to 1954", "url": "https://glam-workbench.net/trove-journals/sydney-telephone-directories/" } ] } }, "nbformat": 4, "nbformat_minor": 5 }