{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Make composite images from lots of Trove newspaper thumbnails\n",
    "\n",
    "This notebook starts with a search in Trove's newspapers. It uses the Trove API to work it's way through the search results. For each article it creates a thumbnail image using the [code from this notebook](Get-article-thumbnail.ipynb). Once this first stage is finished, you have a directory full of lots of thumbnails.\n",
    "\n",
    "The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.\n",
    "\n",
    "You'll need to think carefully about the number of results in your search, and the size of the image you want to create. Harvesting all the thumbnails can take a long time.\n",
    "\n",
    "Also, you need to be able to set a path to a font file, so it's probably best to run this notebook on your local machine rather than in a cloud service, so you have more control over things like font. You might also need to adjust the font size depending on the font you choose.\n",
    "\n",
    "Some examples:\n",
    "\n",
    "* [White Australia Policy](https://easyzoom.com/image/139535)\n",
    "* [Australian aviators, pilots, flyers, and airmen](https://www.easyzoom.com/imageaccess/9d26953ccdf5475cad9c11f308cd7988)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ipywidgets as widgets\n",
    "import requests\n",
    "import random\n",
    "import re\n",
    "import os\n",
    "from IPython.display import display, HTML, FileLink, clear_output\n",
    "from bs4 import BeautifulSoup\n",
    "from PIL import Image, ImageDraw, ImageFont\n",
    "from io import BytesIO\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
    "s.mount('https://', HTTPAdapter(max_retries=retries))\n",
    "s.mount('http://', HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set your parameters\n",
    "\n",
    "Edit the values below as required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "font_path = '/Library/Fonts/Courier New.ttf'\n",
    "font_size = 12\n",
    "# Insert your search query\n",
    "query = 'title:\"white australia policy\"'\n",
    "# Insert your Trove API key\n",
    "api_key = '6pi5hht0d2umqcro'\n",
    "size = 200 # Size of the thumbnails\n",
    "cols = 90 # The width of the final image will be cols x size\n",
    "rows = 55 # The height of the final image will be cols x size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define some functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_article_top(article_url):\n",
    "    '''\n",
    "    Positional information about the article is attached to each line of the OCR output in data attributes.\n",
    "    This function loads the HTML version of the article and scrapes the x, y, and width values for the\n",
    "    top line of text (ie the top of the article).\n",
    "    '''\n",
    "    response = requests.get(article_url)\n",
    "    soup = BeautifulSoup(response.text, 'lxml')\n",
    "    # Lines of OCR are in divs with the class 'zone'\n",
    "    # 'onPage' limits to those on the current page\n",
    "    zones = soup.select('div.zone.onPage')\n",
    "    # Start with the first element, but...\n",
    "    top_element = zones[0]\n",
    "    top_y = int(top_element['data-y'])\n",
    "    # Illustrations might come after text even if they're above them on the page\n",
    "    # So loop through the zones to find the element with the lowest 'y' attribute\n",
    "    for zone in zones:\n",
    "        if int(zone['data-y']) < top_y:\n",
    "            top_y = int(zone['data-y'])\n",
    "            top_element = zone\n",
    "    top_x = int(top_element['data-x'])\n",
    "    top_w = int(top_element['data-w'])\n",
    "    return {'x': top_x, 'y': top_y, 'w': top_w}\n",
    "\n",
    "def get_thumbnail(article, size, font_path, font_size):\n",
    "    buffer = 0\n",
    "    try:\n",
    "        page_id = re.search(r'page\\/(\\d+)', article['trovePageUrl']).group(1)\n",
    "    except (AttributeError, KeyError):\n",
    "        thumb = None\n",
    "    else:\n",
    "        # Get position of top line of article\n",
    "        article_top = get_article_top(article['troveUrl'])\n",
    "        # Construct the url we need to download the image\n",
    "        page_url = 'https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}'.format(page_id, 7)\n",
    "        # Download the page image\n",
    "        response = s.get(page_url, timeout=120)\n",
    "        # Open download as an image for editing\n",
    "        img = Image.open(BytesIO(response.content))\n",
    "        # Use coordinates of top line to create a square box to crop thumbnail\n",
    "        box = (article_top['x'] - buffer, article_top['y'] - buffer, article_top['x'] + article_top['w'] + buffer, article_top['y'] + article_top['w'] + buffer)\n",
    "        try:\n",
    "            # Crop image to create thumb\n",
    "            thumb = img.crop(box)\n",
    "        except OSError:\n",
    "            thumb = None\n",
    "        else:\n",
    "            # Resize thumb\n",
    "            thumb.thumbnail((size, size), Image.ANTIALIAS)\n",
    "            article_id = 'nla.news-article{}'.format(article['id'])\n",
    "            fnt = ImageFont.truetype(font_path, 12)\n",
    "            draw = ImageDraw.Draw(thumb)\n",
    "            try:\n",
    "                # Check if RGB\n",
    "                draw.rectangle([(0, size-10), (size, size)], fill=(255, 255, 255, 255))\n",
    "                draw.text((0,size-10), article_id, font=fnt, fill=(0, 0, 0, 255))\n",
    "            except TypeError:\n",
    "                # Must be grayscale\n",
    "                draw.rectangle([(0, size-10), (200, 200)], fill=(255))\n",
    "                draw.text((0,size-10), article_id, font=fnt, fill=(0))\n",
    "    return thumb\n",
    "        \n",
    "def get_total_results(params):\n",
    "    '''\n",
    "    Get the total number of results for a search.\n",
    "    '''\n",
    "    these_params = params.copy()\n",
    "    these_params['n'] = 0\n",
    "    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params, timeout=60)\n",
    "    # print(response.url)\n",
    "    data = response.json()\n",
    "    return int(data['response']['zone'][0]['records']['total'])\n",
    "        \n",
    "def get_thumbnails(query, api_key, size, font_path, font_size):\n",
    "    #im = Image.new('RGB', (cols*size, rows*size))\n",
    "    params = {\n",
    "        'q': query,\n",
    "        'zone': 'newspaper',\n",
    "        'encoding': 'json',\n",
    "        'bulkHarvest': 'true',\n",
    "        'n': 100,\n",
    "        'key': api_key,\n",
    "        'reclevel': 'full'\n",
    "    }\n",
    "    start = '*'\n",
    "    total = get_total_results(params)\n",
    "    x = 0\n",
    "    y = 0\n",
    "    index = 1\n",
    "    with tqdm(total=total) as pbar:\n",
    "        while start:\n",
    "            params['s'] = start\n",
    "            response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=60)\n",
    "            data = response.json()\n",
    "            # The nextStart parameter is used to get the next page of results.\n",
    "            # If there's no nextStart then it means we're on the last page of results.\n",
    "            try:\n",
    "                start = data['response']['zone'][0]['records']['nextStart']\n",
    "            except KeyError:\n",
    "                start = None\n",
    "            for article in data['response']['zone'][0]['records']['article']:\n",
    "                thumb_file = 'thumbs/{}-nla.news-article{}.jpg'.format(article['date'], article['id'])\n",
    "                if not os.path.exists(thumb_file):\n",
    "                    try:\n",
    "                        # Get page id\n",
    "                        page_id = re.search(r'page\\/(\\d+)', article['trovePageUrl']).group(1)\n",
    "                    except (AttributeError, KeyError):\n",
    "                         pass\n",
    "                    else:\n",
    "                        thumb = get_thumbnail(article, size, font_path, font_size)\n",
    "                        if thumb:\n",
    "                            thumb.save(thumb_file)\n",
    "                pbar.update(1)\n",
    "    \n",
    "def create_composite(cols, rows, size):\n",
    "    im = Image.new('RGB', (cols*size, rows*size))\n",
    "    thumbs = [t for t in os.listdir('thumbs') if t[-4:] == '.jpg']\n",
    "    # This will sort by date, comment it out if you don't want that\n",
    "    # thumbs = sorted(thumbs)\n",
    "    x = 0\n",
    "    y = 0\n",
    "    for index, thumb_file in tqdm(enumerate(thumbs, 1)):\n",
    "        thumb = Image.open('thumbs/{}'.format(thumb_file))\n",
    "        try:\n",
    "            im.paste(thumb, (x, y, x+size, y+size))\n",
    "        except ValueError:\n",
    "            pass\n",
    "        else:\n",
    "            if (index % cols) == 0:\n",
    "                x = 0\n",
    "                y += size\n",
    "            else:\n",
    "                x += size\n",
    "    im.save('composite-{}-{}.jpg'.format(cols, rows), quality=90)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create all the thumbnails"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_thumbnails(query, api_key, size, font_path, font_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Turn the thumbnails into one big image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_composite(cols, rows, size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}