{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3a492270-78da-4a2a-9252-d353b6a2ac48",
   "metadata": {},
   "source": [
    "# Extract text from PDF images using Tesseract\n",
    "\n",
    "Although I was able to [extract text from the PDFs directly](tas-pod-save-text-images.ipynb), I wasn't happy with the quality. In particular, column layout detection was quite variable, munging values from different columns together. After a few tests, I decided that re-OCRing the images using [Tesseract](https://pypi.org/project/pytesseract/) would produce better results. Tesseract's automatic page layout detection does a pretty good job of identifying the columns, and the OCR quality in general seems better. There's still some munging of values across columns and various other errors, but I think the quality is good enough for searching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "c51d719d-e205-4f74-9b2b-2bd797406f36",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import pytesseract\n",
    "from natsort import natsorted, ns\n",
    "from PIL import Image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6dd3bd7-5a48-45d3-80f9-145b17aaffc9",
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Get a list of volumes\n",
    "vols = natsorted(\n",
    "    [d for d in Path(\"tasmania\").glob(\"AUTAS*\") if d.is_dir()], alg=ns.PATH\n",
    ")\n",
    "\n",
    "# Loop through each volume\n",
    "for vol in vols:\n",
    "    print(vol.name)\n",
    "    # Create a directory for the OCRd text\n",
    "    ocr_path = Path(vol, \"tesseract\")\n",
    "    ocr_path.mkdir(exist_ok=True)\n",
    "    # Loop through all the images in the volume\n",
    "    vol_images = natsorted(Path(vol, \"images\").glob(\"*.jpg\"), alg=ns.PATH)\n",
    "    for img_file in vol_images:\n",
    "        with Image.open(img_file) as img:\n",
    "            # Extract the text from the image\n",
    "            # This is the simplest text-extraction method, you can get a lot more info about positions if you need it.\n",
    "            text = pytesseract.image_to_string(img)\n",
    "            # Save the text\n",
    "            Path(ocr_path, f\"{img_file.stem}.txt\").write_text(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dffaa91b-6dc1-4b50-bb21-7e79fc1d316d",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/) as part of the [Everyday Heritage](https://everydayheritage.au/) project."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}