{ "cells": [ { "cell_type": "markdown", "id": "3a492270-78da-4a2a-9252-d353b6a2ac48", "metadata": {}, "source": [ "# Extract text from PDF images using Tesseract\n", "\n", "Although I was able to [extract text from the PDFs directly](tas-pod-save-text-images.ipynb), I wasn't happy with the quality. In particular, column layout detection was quite variable, munging values from different columns together. After a few tests, I decided that re-OCRing the images using [Tesseract](https://pypi.org/project/pytesseract/) would produce better results. Tesseract's automatic page layout detection does a pretty good job of identifying the columns, and the OCR quality in general seems better. There's still some munging of values across columns and various other errors, but I think the quality is good enough for searching." ] }, { "cell_type": "code", "execution_count": 36, "id": "c51d719d-e205-4f74-9b2b-2bd797406f36", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import pytesseract\n", "from natsort import natsorted, ns\n", "from PIL import Image" ] }, { "cell_type": "code", "execution_count": null, "id": "d6dd3bd7-5a48-45d3-80f9-145b17aaffc9", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Get a list of volumes\n", "vols = natsorted(\n", " [d for d in Path(\"tasmania\").glob(\"AUTAS*\") if d.is_dir()], alg=ns.PATH\n", ")\n", "\n", "# Loop through each volume\n", "for vol in vols:\n", " print(vol.name)\n", " # Create a directory for the OCRd text\n", " ocr_path = Path(vol, \"tesseract\")\n", " ocr_path.mkdir(exist_ok=True)\n", " # Loop through all the images in the volume\n", " vol_images = natsorted(Path(vol, \"images\").glob(\"*.jpg\"), alg=ns.PATH)\n", " for img_file in vol_images:\n", " with Image.open(img_file) as img:\n", " # Extract the text from the image\n", " # This is the simplest text-extraction method, you can get a lot more info about positions if you need it.\n", " text = pytesseract.image_to_string(img)\n", " # Save the text\n", " Path(ocr_path, f\"{img_file.stem}.txt\").write_text(text)" ] }, { "cell_type": "markdown", "id": "dffaa91b-6dc1-4b50-bb21-7e79fc1d316d", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/) as part of the [Everyday Heritage](https://everydayheritage.au/) project." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "vscode": { "interpreter": { "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }