{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Make composite images from lots of Trove newspaper thumbnails\n", "\n", "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work it's way through the search results. For each article it creates a thumbnail image using the [code from this notebook](Get-article-thumbnail.ipynb). Once this first stage is finished, you have a directory full of lots of thumbnails.\n", "\n", "The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.\n", "\n", "You'll need to think carefully about the number of results in your search, and the size of the image you want to create. Harvesting all the thumbnails can take a long time.\n", "\n", "Also, you need to be able to set a path to a font file, so it's probably best to run this notebook on your local machine rather than in a cloud service, so you have more control over things like font. You might also need to adjust the font size depending on the font you choose.\n", "\n", "Some examples:\n", "\n", "* [White Australia Policy](https://easyzoom.com/image/139535)\n", "* [Australian aviators, pilots, flyers, and airmen](https://www.easyzoom.com/imageaccess/9d26953ccdf5475cad9c11f308cd7988)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "from io import BytesIO\n", "from pathlib import Path\n", "\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from dotenv import load_dotenv\n", "from PIL import Image, ImageDraw, ImageFont\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "Path(\"thumbs\").mkdir(exist_ok=True)\n", "load_dotenv()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set your parameters\n", "\n", "Edit the values below as required." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "font_path = \"/Library/Fonts/Courier New.ttf\"\n", "font_path = \"/usr/share/fonts/truetype/freefont/FreeMono.ttf\"\n", "font_size = 12\n", "# Insert your search query below\n", "query = 'title:\"white australia policy\" date:[1960 TO 1969]'\n", "\n", "size = 200 # Size of the thumbnails\n", "cols = 90 # The width of the final image will be cols x size\n", "rows = 55 # The height of the final image will be cols x size\n", "\n", "# Insert your Trove API key\n", "api_key = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " api_key = os.getenv(\"TROVE_API_KEY\")\n", "\n", "headers = {\"X-API-KEY\": api_key}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_article_top(article_url):\n", " \"\"\"\n", " Positional information about the article is attached to each line of the OCR output in data attributes.\n", " This function loads the HTML version of the article and scrapes the x, y, and width values for the\n", " top line of text (ie the top of the article).\n", " \"\"\"\n", " response = requests.get(article_url)\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Lines of OCR are in divs with the class 'zone'\n", " # 'onPage' limits to those on the current page\n", " zones = soup.select(\"div.zone.onPage\")\n", " # Start with the first element, but...\n", " top_element = zones[0]\n", " top_y = int(top_element[\"data-y\"])\n", " # Illustrations might come after text even if they're above them on the page\n", " # So loop through the zones to find the element with the lowest 'y' attribute\n", " for zone in zones:\n", " if int(zone[\"data-y\"]) < top_y:\n", " top_y = int(zone[\"data-y\"])\n", " top_element = zone\n", " top_x = int(top_element[\"data-x\"])\n", " top_w = int(top_element[\"data-w\"])\n", " return {\"x\": top_x, \"y\": top_y, \"w\": top_w}\n", "\n", "\n", "def get_thumbnail(article, size, font_path, font_size):\n", " buffer = 0\n", " try:\n", " page_id = re.search(r\"news-page(\\d+)\", article[\"trovePageUrl\"]).group(1)\n", " except (AttributeError, KeyError):\n", " thumb = None\n", " else:\n", " # Get position of top line of article\n", " article_top = get_article_top(article[\"troveUrl\"])\n", " # Construct the url we need to download the image\n", " page_url = (\n", " \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}\".format(\n", " page_id, 7\n", " )\n", " )\n", " # Download the page image\n", " response = s.get(page_url, timeout=120)\n", " # Open download as an image for editing\n", " img = Image.open(BytesIO(response.content))\n", " # Use coordinates of top line to create a square box to crop thumbnail\n", " box = (\n", " article_top[\"x\"] - buffer,\n", " article_top[\"y\"] - buffer,\n", " article_top[\"x\"] + article_top[\"w\"] + buffer,\n", " article_top[\"y\"] + article_top[\"w\"] + buffer,\n", " )\n", " try:\n", " # Crop image to create thumb\n", " thumb = img.crop(box)\n", " except OSError:\n", " thumb = None\n", " else:\n", " # Resize thumb\n", " thumb.thumbnail((size, size), Image.ANTIALIAS)\n", " article_id = \"nla.news-article{}\".format(article[\"id\"])\n", " fnt = ImageFont.truetype(font_path, 12)\n", " draw = ImageDraw.Draw(thumb)\n", " try:\n", " # Check if RGB\n", " draw.rectangle(\n", " [(0, size - 12), (size, size)], fill=(255, 255, 255, 255)\n", " )\n", " draw.text((0, size - 12), article_id, font=fnt, fill=(0, 0, 0, 255))\n", " except TypeError:\n", " # Must be grayscale\n", " draw.rectangle([(0, size - 12), (200, 200)], fill=(255))\n", " draw.text((0, size - 12), article_id, font=fnt, fill=(0))\n", " return thumb\n", "\n", "\n", "def get_total_results(params):\n", " \"\"\"\n", " Get the total number of results for a search.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"n\"] = 0\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=these_params,\n", " headers=headers,\n", " timeout=60,\n", " )\n", " # print(response.url)\n", " data = response.json()\n", " return int(data[\"category\"][0][\"records\"][\"total\"])\n", "\n", "\n", "def get_thumbnails(query, size, font_path, font_size):\n", " # im = Image.new('RGB', (cols*size, rows*size))\n", " params = {\n", " \"q\": query,\n", " \"category\": \"newspaper\",\n", " \"l-artType\": \"newspaper\",\n", " \"encoding\": \"json\",\n", " \"bulkHarvest\": \"true\",\n", " \"n\": 100,\n", " \"reclevel\": \"full\",\n", " }\n", " start = \"*\"\n", " total = get_total_results(params)\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=params,\n", " headers=headers,\n", " timeout=60,\n", " )\n", " data = response.json()\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data[\"category\"][0][\"records\"][\"nextStart\"]\n", " except KeyError:\n", " start = None\n", " for article in data[\"category\"][0][\"records\"][\"article\"]:\n", " thumb_file = \"thumbs/{}-nla.news-article{}.jpg\".format(\n", " article[\"date\"], article[\"id\"]\n", " )\n", " if not os.path.exists(thumb_file):\n", " thumb = get_thumbnail(article, size, font_path, font_size)\n", " if thumb:\n", " thumb.save(thumb_file)\n", " pbar.update(1)\n", "\n", "\n", "def create_composite(cols, rows, size):\n", " im = Image.new(\"RGB\", (cols * size, rows * size))\n", " thumbs = [t for t in os.listdir(\"thumbs\") if t[-4:] == \".jpg\"]\n", " # This will sort by date, comment it out if you don't want that\n", " # thumbs = sorted(thumbs)\n", " x = 0\n", " y = 0\n", " for index, thumb_file in tqdm(enumerate(thumbs, 1)):\n", " thumb = Image.open(\"thumbs/{}\".format(thumb_file))\n", " try:\n", " im.paste(thumb, (x, y, x + size, y + size))\n", " except ValueError:\n", " pass\n", " else:\n", " if (index % cols) == 0:\n", " x = 0\n", " y += size\n", " else:\n", " x += size\n", " im.save(\"composite-{}-{}.jpg\".format(cols, rows), quality=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create all the thumbnails" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_thumbnails(query, size, font_path, font_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Turn the thumbnails into one big image" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_composite(cols, rows, size)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "description": "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work its way through the search results. For each article it creates a thumbnail image using the code from this notebook. Once this first stage is finished, you have a directory full of lots of thumbnails. The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.", "mainEntityOfPage": "https://glam-workbench.net/trove-newspapers/Composite-thumbnails/", "name": "Make composite images from lots of Trove newspaper thumbnails" } }, "nbformat": 4, "nbformat_minor": 4 }