{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Make composite images from lots of Trove newspaper thumbnails\n", "\n", "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work it's way through the search results. For each article it creates a thumbnail image using the [code from this notebook](Get-article-thumbnail.ipynb). Once this first stage is finished, you have a directory full of lots of thumbnails.\n", "\n", "The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.\n", "\n", "You'll need to think carefully about the number of results in your search, and the size of the image you want to create. Harvesting all the thumbnails can take a long time.\n", "\n", "Also, you need to be able to set a path to a font file, so it's probably best to run this notebook on your local machine rather than in a cloud service, so you have more control over things like font. You might also need to adjust the font size depending on the font you choose.\n", "\n", "Some examples:\n", "\n", "* [White Australia Policy](https://easyzoom.com/image/139535)\n", "* [Australian aviators, pilots, flyers, and airmen](https://www.easyzoom.com/imageaccess/9d26953ccdf5475cad9c11f308cd7988)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "from io import BytesIO\n", "from pathlib import Path\n", "\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from PIL import Image, ImageDraw, ImageFont\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "Path(\"thumbs\").mkdir(exist_ok=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Load variables from the .env file if it exists\n", "# Use %%capture to suppress messages\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set your parameters\n", "\n", "Edit the values below as required." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "font_path = \"/Library/Fonts/Courier New.ttf\"\n", "font_path = \"/usr/share/fonts/truetype/freefont/FreeMono.ttf\"\n", "font_size = 12\n", "# Insert your search query below\n", "query = 'title:\"white australia policy\" date:[1960 TO 1969]'\n", "\n", "size = 200 # Size of the thumbnails\n", "cols = 90 # The width of the final image will be cols x size\n", "rows = 55 # The height of the final image will be cols x size\n", "\n", "# Insert your Trove API key\n", "api_key = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " api_key = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_article_top(article_url):\n", " \"\"\"\n", " Positional information about the article is attached to each line of the OCR output in data attributes.\n", " This function loads the HTML version of the article and scrapes the x, y, and width values for the\n", " top line of text (ie the top of the article).\n", " \"\"\"\n", " response = requests.get(article_url)\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Lines of OCR are in divs with the class 'zone'\n", " # 'onPage' limits to those on the current page\n", " zones = soup.select(\"div.zone.onPage\")\n", " # Start with the first element, but...\n", " top_element = zones[0]\n", " top_y = int(top_element[\"data-y\"])\n", " # Illustrations might come after text even if they're above them on the page\n", " # So loop through the zones to find the element with the lowest 'y' attribute\n", " for zone in zones:\n", " if int(zone[\"data-y\"]) < top_y:\n", " top_y = int(zone[\"data-y\"])\n", " top_element = zone\n", " top_x = int(top_element[\"data-x\"])\n", " top_w = int(top_element[\"data-w\"])\n", " return {\"x\": top_x, \"y\": top_y, \"w\": top_w}\n", "\n", "\n", "def get_thumbnail(article, size, font_path, font_size):\n", " buffer = 0\n", " try:\n", " page_id = re.search(r\"page\\/(\\d+)\", article[\"trovePageUrl\"]).group(1)\n", " except (AttributeError, KeyError):\n", " thumb = None\n", " else:\n", " # Get position of top line of article\n", " article_top = get_article_top(article[\"troveUrl\"])\n", " # Construct the url we need to download the image\n", " page_url = (\n", " \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}\".format(\n", " page_id, 7\n", " )\n", " )\n", " # Download the page image\n", " response = s.get(page_url, timeout=120)\n", " # Open download as an image for editing\n", " img = Image.open(BytesIO(response.content))\n", " # Use coordinates of top line to create a square box to crop thumbnail\n", " box = (\n", " article_top[\"x\"] - buffer,\n", " article_top[\"y\"] - buffer,\n", " article_top[\"x\"] + article_top[\"w\"] + buffer,\n", " article_top[\"y\"] + article_top[\"w\"] + buffer,\n", " )\n", " try:\n", " # Crop image to create thumb\n", " thumb = img.crop(box)\n", " except OSError:\n", " thumb = None\n", " else:\n", " # Resize thumb\n", " thumb.thumbnail((size, size), Image.ANTIALIAS)\n", " article_id = \"nla.news-article{}\".format(article[\"id\"])\n", " fnt = ImageFont.truetype(font_path, 12)\n", " draw = ImageDraw.Draw(thumb)\n", " try:\n", " # Check if RGB\n", " draw.rectangle(\n", " [(0, size - 12), (size, size)], fill=(255, 255, 255, 255)\n", " )\n", " draw.text((0, size - 12), article_id, font=fnt, fill=(0, 0, 0, 255))\n", " except TypeError:\n", " # Must be grayscale\n", " draw.rectangle([(0, size - 12), (200, 200)], fill=(255))\n", " draw.text((0, size - 12), article_id, font=fnt, fill=(0))\n", " return thumb\n", "\n", "\n", "def get_total_results(params):\n", " \"\"\"\n", " Get the total number of results for a search.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"n\"] = 0\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v2/result\", params=these_params, timeout=60\n", " )\n", " # print(response.url)\n", " data = response.json()\n", " return int(data[\"response\"][\"zone\"][0][\"records\"][\"total\"])\n", "\n", "\n", "def get_thumbnails(query, api_key, size, font_path, font_size):\n", " # im = Image.new('RGB', (cols*size, rows*size))\n", " params = {\n", " \"q\": query,\n", " \"zone\": \"newspaper\",\n", " \"encoding\": \"json\",\n", " \"bulkHarvest\": \"true\",\n", " \"n\": 100,\n", " \"key\": api_key,\n", " \"reclevel\": \"full\",\n", " }\n", " start = \"*\"\n", " total = get_total_results(params)\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v2/result\", params=params, timeout=60\n", " )\n", " data = response.json()\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data[\"response\"][\"zone\"][0][\"records\"][\"nextStart\"]\n", " except KeyError:\n", " start = None\n", " for article in data[\"response\"][\"zone\"][0][\"records\"][\"article\"]:\n", " thumb_file = \"thumbs/{}-nla.news-article{}.jpg\".format(\n", " article[\"date\"], article[\"id\"]\n", " )\n", " if not os.path.exists(thumb_file):\n", " thumb = get_thumbnail(article, size, font_path, font_size)\n", " if thumb:\n", " thumb.save(thumb_file)\n", " pbar.update(1)\n", "\n", "\n", "def create_composite(cols, rows, size):\n", " im = Image.new(\"RGB\", (cols * size, rows * size))\n", " thumbs = [t for t in os.listdir(\"thumbs\") if t[-4:] == \".jpg\"]\n", " # This will sort by date, comment it out if you don't want that\n", " # thumbs = sorted(thumbs)\n", " x = 0\n", " y = 0\n", " for index, thumb_file in tqdm(enumerate(thumbs, 1)):\n", " thumb = Image.open(\"thumbs/{}\".format(thumb_file))\n", " try:\n", " im.paste(thumb, (x, y, x + size, y + size))\n", " except ValueError:\n", " pass\n", " else:\n", " if (index % cols) == 0:\n", " x = 0\n", " y += size\n", " else:\n", " x += size\n", " im.save(\"composite-{}-{}.jpg\".format(cols, rows), quality=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create all the thumbnails" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_thumbnails(query, api_key, size, font_path, font_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Turn the thumbnails into one big image" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_composite(cols, rows, size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }