{ "cells": [ { "cell_type": "markdown", "id": "692ae7bb-d079-4162-953e-0df83f19497d", "metadata": {}, "source": [ "# Download and process Tasmanian Post Office Directory PDFs\n", "\n", "PDFs of the Tasmanian Post Office Directory from 1890 to 1948 are [available from Libraries Tasmania](https://stors.tas.gov.au/ILS/SD_ILS-981598). This notebook downloads all 48 PDFs, then extracts images and text from the PDFs using PyMuPDF.\n", "\n", "Further processing:\n", "- I wasn't happy with the quality of the text extracted from the PDFs, so I decided to [re-OCR the images using Tesseract](tas-pod-ocr-with-tesseract.ipynb).\n", "- The images were [uploaded to Amazon s3](tas-pod-upload-images.ipynb) for delivery via IIIF.\n", "- Once the text and images were ready, I [loaded everything into an SQLite database](tas-pod-add-to-datasette.ipynb) for delivery via Datasette." ] }, { "cell_type": "code", "execution_count": 1, "id": "503ec386-84d0-4b51-a55e-dc92d3767219", "metadata": {}, "outputs": [], "source": [ "import re\n", "from pathlib import Path\n", "\n", "import fitz\n", "import requests" ] }, { "cell_type": "code", "execution_count": 18, "id": "aead70a5-0174-467b-907c-7975b5d2a02f", "metadata": {}, "outputs": [], "source": [ "# The base url for downloads from Libraries Tas\n", "download_url = \"https://stors.tas.gov.au/download/\"\n", "\n", "# This HTML list of PDFs was just copied from the page source of the Libraries Tas viewer. It could of course be scraped automatically.\n", "pdf_list = \"\"\"\n", "
  • Tasmanian Post Office Directory 1896-97
  • Wise's Tasmanian Directory 1900
  • Tasmanian Post Office Directory 1890-91
  • Tasmanian Post Office Directory 1892-93
  • Tasmanian Post Office Directory 1894-95
  • Wise's Tasmanian Directory 1898
  • Wise's Tasmanian Directory 1906
  • Wise's Tasmanian Directory 1899
  • Wise's Tasmanian Directory 1907
  • Wise's Tasmanian Directory 1908
  • Wise's Tasmanian Directory 1901
  • Wise's Tasmanian Directory 1909
  • Wise's Tasmanian Directory 1902
  • Wise's Tasmanian Directory 1910
  • Wise's Tasmanian Directory 1903
  • Wise's Tasmanian Directory 1911
  • Wise's Tasmanian Directory 1904
  • Wise's Tasmanian Directory 1912
  • Wise's Tasmanian Directory 1905
  • Wise's Tasmanian Directory 1913
  • Wise's Tasmanian Directory 1914
  • Wise's Tasmanian Directory 1915
  • Wise's Tasmanian Directory 1916
  • Wise's Tasmanian Directory 1917
  • Wise's Tasmanian Directory 1918
  • Wise's Tasmanian Directory 1919
  • Wise's Tasmanian Directory 1920
  • Wise's Tasmanian Directory 1921
  • Wise's Tasmanian Directory 1922
  • Wise's Tasmanian Directory 1923
  • Wise's Tasmanian Directory 1924
  • Wise's Tasmanian Directory 1925
  • Wise's Tasmanian Directory 1926
  • Wise's Tasmanian Directory 1927
  • Wise's Tasmanian Directory 1928
  • Wise's Tasmanian Directory 1929
  • Wise's Tasmanian Directory 1930
  • Wise's Tasmanian Directory 1931
  • Wise's Tasmanian Directory 1932
  • Wise\\'s Tasmanian Directory 1933-34
  • Wise's Tasmanian Directory 1935
  • Wise's Tasmanian Directory 1936
  • Wise's Tasmanian Directory 1937
  • Wise's Tasmanian Directory 1938
  • Wise's Tasmanian Directory 1939
  • Wise\\'s Tasmanian Directory1940-41
  • Wise\\'s Tasmanian Directory1941-42
  • Wise's Tasmanian Directory 1942-43
  • Wise\\'s Tasmanian Directory1943-44
  • Wise\\'s Tasmanian Directory1944-45
  • Wise\\'s Tasmanian Directory1945-46
  • Wise's Tasmanian Directory 1945
  • Wise's Tasmanian Directory 1947
  • Wise's Tasmanian Directory 1948
  • \n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "309b1abd-22ed-41c2-af90-69be2d5b7aaa", "metadata": {}, "source": [ "## Download all of the PDFs\n", "\n", "We'll extract all the volume identifiers from the HTML then download each PDF." ] }, { "cell_type": "code", "execution_count": null, "id": "57883c43-6193-49e5-acb6-89dc22d3abd2", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Download all the PDFs\n", "# Extract all the volume identifiers from the HTML fragment\n", "pids = re.findall(r\"href=\\\"/([A-Z0-9\\-]+)\\\"\", pdf_list)\n", "# Loop through the identifiers, downloading and saving each PDF\n", "for pid in pids:\n", " print(pid)\n", " response = requests.get(f\"{download_url}{pid}\")\n", " Path(\"tasmania\", f\"{pid}.pdf\").write_bytes(response.content)" ] }, { "cell_type": "markdown", "id": "2a166fc9-cb45-4a41-981b-6377c02524ed", "metadata": {}, "source": [ "## Extract text and images from the PDFs\n", "\n", "We're using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) to extract information from the PDFs. Once processed, we'll end up with a folder for each volume, within which are folders labelled 'text' and 'images' containing all the extracted text and images." ] }, { "cell_type": "code", "execution_count": null, "id": "7681d880-cb16-45a5-bd08-07f288ee9bac", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Loop through all the PDFs\n", "for pdf in Path(\"tasmania\").glob(\"*.pdf\"):\n", " print(pdf.name)\n", " pid = pdf.name.split(\".\")[0]\n", " # Create directory for volume\n", " data_dir = Path(\"tasmania\", pid)\n", " data_dir.mkdir(exist_ok=True)\n", " # Create directories for text and images\n", " text_dir = Path(data_dir, \"text\")\n", " image_dir = Path(data_dir, \"images\")\n", " text_dir.mkdir(exist_ok=True)\n", " image_dir.mkdir(exist_ok=True)\n", " # Open the PDF with PyMuPDF\n", " doc = fitz.open(pdf)\n", " for i, page in enumerate(doc):\n", " # Get images\n", " for xref in page.get_images():\n", " pix = fitz.Pixmap(doc, xref[0])\n", " image_file = Path(image_dir, f\"{pid}-{i+1}.jpg\")\n", " pix.save(image_file)\n", " # Get text\n", " text_path = Path(text_dir, f\"{pid}-{i+1}.txt\")\n", " # The sort option tries to organise the text into a natural reading view.\n", " # However, this doesn't always manage to identify column boundaries, so values from adjacent columns can be munged together.\n", " text = page.get_text(sort=True)\n", " Path(text_path).write_text(text)" ] }, { "cell_type": "markdown", "id": "754150bc-fb02-4709-9b78-15318f51d1d4", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/) as part of the [Everyday Heritage](https://everydayheritage.au/) project." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "vscode": { "interpreter": { "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }