{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# Make composite images from lots of Trove newspaper thumbnails\n",
    "\n",
    "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work it's way through the search results. For each article it creates a thumbnail image using the [code from this notebook](Get-article-thumbnail.ipynb). Once this first stage is finished, you have a directory full of lots of thumbnails.\n",
    "\n",
    "The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.\n",
    "\n",
    "You'll need to think carefully about the number of results in your search, and the size of the image you want to create. Harvesting all the thumbnails can take a long time.\n",
    "\n",
    "Also, you need to be able to set a path to a font file, so it's probably best to run this notebook on your local machine rather than in a cloud service, so you have more control over things like font. You might also need to adjust the font size depending on the font you choose.\n",
    "\n",
    "Some examples:\n",
    "\n",
    "* [White Australia Policy](https://easyzoom.com/image/139535)\n",
    "* [Australian aviators, pilots, flyers, and airmen](https://www.easyzoom.com/imageaccess/9d26953ccdf5475cad9c11f308cd7988)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "from io import BytesIO\n",
    "from pathlib import Path\n",
    "\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "from dotenv import load_dotenv\n",
    "from PIL import Image, ImageDraw, ImageFont\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
    "\n",
    "Path(\"thumbs\").mkdir(exist_ok=True)\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set your parameters\n",
    "\n",
    "Edit the values below as required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "font_path = \"/Library/Fonts/Courier New.ttf\"\n",
    "font_path = \"/usr/share/fonts/truetype/freefont/FreeMono.ttf\"\n",
    "font_size = 12\n",
    "# Insert your search query below\n",
    "query = 'title:\"white australia policy\" date:[1960 TO 1969]'\n",
    "\n",
    "size = 200  # Size of the thumbnails\n",
    "cols = 90  # The width of the final image will be cols x size\n",
    "rows = 55  # The height of the final image will be cols x size\n",
    "\n",
    "# Insert your Trove API key\n",
    "api_key = \"YOUR API KEY\"\n",
    "\n",
    "# Use api key value from environment variables if it is available\n",
    "if os.getenv(\"TROVE_API_KEY\"):\n",
    "    api_key = os.getenv(\"TROVE_API_KEY\")\n",
    "\n",
    "headers = {\"X-API-KEY\": api_key}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define some functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_article_top(article_url):\n",
    "    \"\"\"\n",
    "    Positional information about the article is attached to each line of the OCR output in data attributes.\n",
    "    This function loads the HTML version of the article and scrapes the x, y, and width values for the\n",
    "    top line of text (ie the top of the article).\n",
    "    \"\"\"\n",
    "    response = requests.get(article_url)\n",
    "    soup = BeautifulSoup(response.text, \"lxml\")\n",
    "    # Lines of OCR are in divs with the class 'zone'\n",
    "    # 'onPage' limits to those on the current page\n",
    "    zones = soup.select(\"div.zone.onPage\")\n",
    "    # Start with the first element, but...\n",
    "    top_element = zones[0]\n",
    "    top_y = int(top_element[\"data-y\"])\n",
    "    # Illustrations might come after text even if they're above them on the page\n",
    "    # So loop through the zones to find the element with the lowest 'y' attribute\n",
    "    for zone in zones:\n",
    "        if int(zone[\"data-y\"]) < top_y:\n",
    "            top_y = int(zone[\"data-y\"])\n",
    "            top_element = zone\n",
    "    top_x = int(top_element[\"data-x\"])\n",
    "    top_w = int(top_element[\"data-w\"])\n",
    "    return {\"x\": top_x, \"y\": top_y, \"w\": top_w}\n",
    "\n",
    "\n",
    "def get_thumbnail(article, size, font_path, font_size):\n",
    "    buffer = 0\n",
    "    try:\n",
    "        page_id = re.search(r\"news-page(\\d+)\", article[\"trovePageUrl\"]).group(1)\n",
    "    except (AttributeError, KeyError):\n",
    "        thumb = None\n",
    "    else:\n",
    "        # Get position of top line of article\n",
    "        article_top = get_article_top(article[\"troveUrl\"])\n",
    "        # Construct the url we need to download the image\n",
    "        page_url = (\n",
    "            \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}\".format(\n",
    "                page_id, 7\n",
    "            )\n",
    "        )\n",
    "        # Download the page image\n",
    "        response = s.get(page_url, timeout=120)\n",
    "        # Open download as an image for editing\n",
    "        img = Image.open(BytesIO(response.content))\n",
    "        # Use coordinates of top line to create a square box to crop thumbnail\n",
    "        box = (\n",
    "            article_top[\"x\"] - buffer,\n",
    "            article_top[\"y\"] - buffer,\n",
    "            article_top[\"x\"] + article_top[\"w\"] + buffer,\n",
    "            article_top[\"y\"] + article_top[\"w\"] + buffer,\n",
    "        )\n",
    "        try:\n",
    "            # Crop image to create thumb\n",
    "            thumb = img.crop(box)\n",
    "        except OSError:\n",
    "            thumb = None\n",
    "        else:\n",
    "            # Resize thumb\n",
    "            thumb.thumbnail((size, size), Image.ANTIALIAS)\n",
    "            article_id = \"nla.news-article{}\".format(article[\"id\"])\n",
    "            fnt = ImageFont.truetype(font_path, 12)\n",
    "            draw = ImageDraw.Draw(thumb)\n",
    "            try:\n",
    "                # Check if RGB\n",
    "                draw.rectangle(\n",
    "                    [(0, size - 12), (size, size)], fill=(255, 255, 255, 255)\n",
    "                )\n",
    "                draw.text((0, size - 12), article_id, font=fnt, fill=(0, 0, 0, 255))\n",
    "            except TypeError:\n",
    "                # Must be grayscale\n",
    "                draw.rectangle([(0, size - 12), (200, 200)], fill=(255))\n",
    "                draw.text((0, size - 12), article_id, font=fnt, fill=(0))\n",
    "    return thumb\n",
    "\n",
    "\n",
    "def get_total_results(params):\n",
    "    \"\"\"\n",
    "    Get the total number of results for a search.\n",
    "    \"\"\"\n",
    "    these_params = params.copy()\n",
    "    these_params[\"n\"] = 0\n",
    "    response = s.get(\n",
    "        \"https://api.trove.nla.gov.au/v3/result\",\n",
    "        params=these_params,\n",
    "        headers=headers,\n",
    "        timeout=60,\n",
    "    )\n",
    "    # print(response.url)\n",
    "    data = response.json()\n",
    "    return int(data[\"category\"][0][\"records\"][\"total\"])\n",
    "\n",
    "\n",
    "def get_thumbnails(query, size, font_path, font_size):\n",
    "    # im = Image.new('RGB', (cols*size, rows*size))\n",
    "    params = {\n",
    "        \"q\": query,\n",
    "        \"category\": \"newspaper\",\n",
    "        \"l-artType\": \"newspaper\",\n",
    "        \"encoding\": \"json\",\n",
    "        \"bulkHarvest\": \"true\",\n",
    "        \"n\": 100,\n",
    "        \"reclevel\": \"full\",\n",
    "    }\n",
    "    start = \"*\"\n",
    "    total = get_total_results(params)\n",
    "    with tqdm(total=total) as pbar:\n",
    "        while start:\n",
    "            params[\"s\"] = start\n",
    "            response = s.get(\n",
    "                \"https://api.trove.nla.gov.au/v3/result\",\n",
    "                params=params,\n",
    "                headers=headers,\n",
    "                timeout=60,\n",
    "            )\n",
    "            data = response.json()\n",
    "            # The nextStart parameter is used to get the next page of results.\n",
    "            # If there's no nextStart then it means we're on the last page of results.\n",
    "            try:\n",
    "                start = data[\"category\"][0][\"records\"][\"nextStart\"]\n",
    "            except KeyError:\n",
    "                start = None\n",
    "            for article in data[\"category\"][0][\"records\"][\"article\"]:\n",
    "                thumb_file = \"thumbs/{}-nla.news-article{}.jpg\".format(\n",
    "                    article[\"date\"], article[\"id\"]\n",
    "                )\n",
    "                if not os.path.exists(thumb_file):\n",
    "                    thumb = get_thumbnail(article, size, font_path, font_size)\n",
    "                    if thumb:\n",
    "                        thumb.save(thumb_file)\n",
    "                pbar.update(1)\n",
    "\n",
    "\n",
    "def create_composite(cols, rows, size):\n",
    "    im = Image.new(\"RGB\", (cols * size, rows * size))\n",
    "    thumbs = [t for t in os.listdir(\"thumbs\") if t[-4:] == \".jpg\"]\n",
    "    # This will sort by date, comment it out if you don't want that\n",
    "    # thumbs = sorted(thumbs)\n",
    "    x = 0\n",
    "    y = 0\n",
    "    for index, thumb_file in tqdm(enumerate(thumbs, 1)):\n",
    "        thumb = Image.open(\"thumbs/{}\".format(thumb_file))\n",
    "        try:\n",
    "            im.paste(thumb, (x, y, x + size, y + size))\n",
    "        except ValueError:\n",
    "            pass\n",
    "        else:\n",
    "            if (index % cols) == 0:\n",
    "                x = 0\n",
    "                y += size\n",
    "            else:\n",
    "                x += size\n",
    "    im.save(\"composite-{}-{}.jpg\".format(cols, rows), quality=90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create all the thumbnails"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_thumbnails(query, size, font_path, font_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Turn the thumbnails into one big image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_composite(cols, rows, size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).  \n",
    "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "rocrate": {
   "author": [
    {
     "mainEntityOfPage": "https://timsherratt.au",
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "description": "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work its way through the search results. For each article it creates a thumbnail image using the code from this notebook. Once this first stage is finished, you have a directory full of lots of thumbnails. The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.",
   "mainEntityOfPage": "https://glam-workbench.net/trove-newspapers/Composite-thumbnails/",
   "name": "Make composite images from lots of Trove newspaper thumbnails"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}