{ "cells": [ { "cell_type": "markdown", "id": "692ae7bb-d079-4162-953e-0df83f19497d", "metadata": {}, "source": [ "# Download and process Tasmanian Post Office Directory PDFs\n", "\n", "PDFs of the Tasmanian Post Office Directory from 1890 to 1948 are [available from Libraries Tasmania](https://stors.tas.gov.au/ILS/SD_ILS-981598). This notebook downloads all 48 PDFs, then extracts images and text from the PDFs using PyMuPDF.\n", "\n", "Further processing:\n", "- I wasn't happy with the quality of the text extracted from the PDFs, so I decided to [re-OCR the images using Tesseract](tas-pod-ocr-with-tesseract.ipynb).\n", "- The images were [uploaded to Amazon s3](tas-pod-upload-images.ipynb) for delivery via IIIF.\n", "- Once the text and images were ready, I [loaded everything into an SQLite database](tas-pod-add-to-datasette.ipynb) for delivery via Datasette." ] }, { "cell_type": "code", "execution_count": 1, "id": "503ec386-84d0-4b51-a55e-dc92d3767219", "metadata": {}, "outputs": [], "source": [ "import re\n", "from pathlib import Path\n", "\n", "import fitz\n", "import requests" ] }, { "cell_type": "code", "execution_count": 18, "id": "aead70a5-0174-467b-907c-7975b5d2a02f", "metadata": {}, "outputs": [], "source": [ "# The base url for downloads from Libraries Tas\n", "download_url = \"https://stors.tas.gov.au/download/\"\n", "\n", "# This HTML list of PDFs was just copied from the page source of the Libraries Tas viewer. It could of course be scraped automatically.\n", "pdf_list = \"\"\"\n", "