{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Make composite images from lots of Trove newspaper thumbnails\n", "\n", "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work it's way through the search results. For each article it creates a thumbnail image using the [code from this notebook](Get-article-thumbnail.ipynb). Once this first stage is finished, you have a directory full of lots of thumbnails.\n", "\n", "The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.\n", "\n", "You'll need to think carefully about the number of results in your search, and the size of the image you want to create. Harvesting all the thumbnails can take a long time.\n", "\n", "Also, you need to be able to set a path to a font file, so it's probably best to run this notebook on your local machine rather than in a cloud service, so you have more control over things like font. You might also need to adjust the font size depending on the font you choose.\n", "\n", "Some examples:\n", "\n", "* [White Australia Policy](https://easyzoom.com/image/139535)\n", "* [Australian aviators, pilots, flyers, and airmen](https://www.easyzoom.com/imageaccess/9d26953ccdf5475cad9c11f308cd7988)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ipywidgets as widgets\n", "import requests\n", "import random\n", "import re\n", "import os\n", "from IPython.display import display, HTML, FileLink, clear_output\n", "from bs4 import BeautifulSoup\n", "from PIL import Image, ImageDraw, ImageFont\n", "from io import BytesIO\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('https://', HTTPAdapter(max_retries=retries))\n", "s.mount('http://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set your parameters\n", "\n", "Edit the values below as required." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "font_path = '/Library/Fonts/Courier New.ttf'\n", "font_size = 12\n", "# Insert your search query\n", "query = 'title:\"white australia policy\"'\n", "# Insert your Trove API key\n", "api_key = '6pi5hht0d2umqcro'\n", "size = 200 # Size of the thumbnails\n", "cols = 90 # The width of the final image will be cols x size\n", "rows = 55 # The height of the final image will be cols x size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_article_top(article_url):\n", " '''\n", " Positional information about the article is attached to each line of the OCR output in data attributes.\n", " This function loads the HTML version of the article and scrapes the x, y, and width values for the\n", " top line of text (ie the top of the article).\n", " '''\n", " response = requests.get(article_url)\n", " soup = BeautifulSoup(response.text, 'lxml')\n", " # Lines of OCR are in divs with the class 'zone'\n", " # 'onPage' limits to those on the current page\n", " zones = soup.select('div.zone.onPage')\n", " # Start with the first element, but...\n", " top_element = zones[0]\n", " top_y = int(top_element['data-y'])\n", " # Illustrations might come after text even if they're above them on the page\n", " # So loop through the zones to find the element with the lowest 'y' attribute\n", " for zone in zones:\n", " if int(zone['data-y']) < top_y:\n", " top_y = int(zone['data-y'])\n", " top_element = zone\n", " top_x = int(top_element['data-x'])\n", " top_w = int(top_element['data-w'])\n", " return {'x': top_x, 'y': top_y, 'w': top_w}\n", "\n", "def get_thumbnail(article, size, font_path, font_size):\n", " buffer = 0\n", " try:\n", " page_id = re.search(r'page\\/(\\d+)', article['trovePageUrl']).group(1)\n", " except (AttributeError, KeyError):\n", " thumb = None\n", " else:\n", " # Get position of top line of article\n", " article_top = get_article_top(article['troveUrl'])\n", " # Construct the url we need to download the image\n", " page_url = 'https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}'.format(page_id, 7)\n", " # Download the page image\n", " response = s.get(page_url, timeout=120)\n", " # Open download as an image for editing\n", " img = Image.open(BytesIO(response.content))\n", " # Use coordinates of top line to create a square box to crop thumbnail\n", " box = (article_top['x'] - buffer, article_top['y'] - buffer, article_top['x'] + article_top['w'] + buffer, article_top['y'] + article_top['w'] + buffer)\n", " try:\n", " # Crop image to create thumb\n", " thumb = img.crop(box)\n", " except OSError:\n", " thumb = None\n", " else:\n", " # Resize thumb\n", " thumb.thumbnail((size, size), Image.ANTIALIAS)\n", " article_id = 'nla.news-article{}'.format(article['id'])\n", " fnt = ImageFont.truetype(font_path, 12)\n", " draw = ImageDraw.Draw(thumb)\n", " try:\n", " # Check if RGB\n", " draw.rectangle([(0, size-10), (size, size)], fill=(255, 255, 255, 255))\n", " draw.text((0,size-10), article_id, font=fnt, fill=(0, 0, 0, 255))\n", " except TypeError:\n", " # Must be grayscale\n", " draw.rectangle([(0, size-10), (200, 200)], fill=(255))\n", " draw.text((0,size-10), article_id, font=fnt, fill=(0))\n", " return thumb\n", " \n", "def get_total_results(params):\n", " '''\n", " Get the total number of results for a search.\n", " '''\n", " these_params = params.copy()\n", " these_params['n'] = 0\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params, timeout=60)\n", " # print(response.url)\n", " data = response.json()\n", " return int(data['response']['zone'][0]['records']['total'])\n", " \n", "def get_thumbnails(query, api_key, size, font_path, font_size):\n", " #im = Image.new('RGB', (cols*size, rows*size))\n", " params = {\n", " 'q': query,\n", " 'zone': 'newspaper',\n", " 'encoding': 'json',\n", " 'bulkHarvest': 'true',\n", " 'n': 100,\n", " 'key': api_key,\n", " 'reclevel': 'full'\n", " }\n", " start = '*'\n", " total = get_total_results(params)\n", " x = 0\n", " y = 0\n", " index = 1\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params['s'] = start\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=60)\n", " data = response.json()\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data['response']['zone'][0]['records']['nextStart']\n", " except KeyError:\n", " start = None\n", " for article in data['response']['zone'][0]['records']['article']:\n", " thumb_file = 'thumbs/{}-nla.news-article{}.jpg'.format(article['date'], article['id'])\n", " if not os.path.exists(thumb_file):\n", " try:\n", " # Get page id\n", " page_id = re.search(r'page\\/(\\d+)', article['trovePageUrl']).group(1)\n", " except (AttributeError, KeyError):\n", " pass\n", " else:\n", " thumb = get_thumbnail(article, size, font_path, font_size)\n", " if thumb:\n", " thumb.save(thumb_file)\n", " pbar.update(1)\n", " \n", "def create_composite(cols, rows, size):\n", " im = Image.new('RGB', (cols*size, rows*size))\n", " thumbs = [t for t in os.listdir('thumbs') if t[-4:] == '.jpg']\n", " # This will sort by date, comment it out if you don't want that\n", " # thumbs = sorted(thumbs)\n", " x = 0\n", " y = 0\n", " for index, thumb_file in tqdm(enumerate(thumbs, 1)):\n", " thumb = Image.open('thumbs/{}'.format(thumb_file))\n", " try:\n", " im.paste(thumb, (x, y, x+size, y+size))\n", " except ValueError:\n", " pass\n", " else:\n", " if (index % cols) == 0:\n", " x = 0\n", " y += size\n", " else:\n", " x += size\n", " im.save('composite-{}-{}.jpg'.format(cols, rows), quality=90)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create all the thumbnails" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_thumbnails(query, api_key, size, font_path, font_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Turn the thumbnails into one big image" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_composite(cols, rows, size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }