{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generate a thumbnail image from a Trove newspaper article\n", "\n", "In [another notebook](Save-page-image.ipynb), I showed how to get high-resolution page images from newspapers. But what if you only want a nice square thumbnail for display purposes? This notebook gets the page image and then crops and resizes the top of the article to create a thumbnail.\n", "\n", "Of course, if you're doing this to lots of articles you won't want to feed each one in manually. If you're viewing this notebook in app mode (no code visible), just click on the 'Edit app' button to see what's going on behind the scenes. You should be able to copy and modify the code to suit your purposes.\n", "\n", "Briefly, the steps to generate a thumbnail are:\n", "\n", "* Scrape the article's HTML page to get the page identifier and the coordinates of the article on the page\n", "* Use the page identifier to download a high-res page image\n", "* Crop a square image from the page using the coordinates\n", "* Resize the cropped image" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import base64\n", "import os\n", "import re\n", "from io import BytesIO\n", "\n", "import ipywidgets as widgets\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from IPython.display import HTML, display\n", "from PIL import Image, ImageOps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# TESTING\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titles = {}\n", "\n", "results = widgets.Output()\n", "\n", "\n", "def get_box(zones):\n", " \"\"\"\n", " Loop through all the zones to find the outer limits of each boundary.\n", " Return a bounding box around the article.\n", " \"\"\"\n", " left = 10000\n", " right = 0\n", " top = 10000\n", " bottom = 0\n", " page_id = zones[0][\"data-page-id\"]\n", " for zone in zones:\n", " if int(zone[\"data-x\"]) < left:\n", " left = int(zone[\"data-x\"])\n", " for zone in zones:\n", " if int(zone[\"data-x\"]) < (left + 200):\n", " if int(zone[\"data-y\"]) < top:\n", " top = int(zone[\"data-y\"])\n", " if (int(zone[\"data-x\"]) + int(zone[\"data-w\"])) > right:\n", " right = int(zone[\"data-x\"]) + int(zone[\"data-w\"])\n", " if (int(zone[\"data-y\"]) + int(zone[\"data-h\"])) > bottom:\n", " bottom = int(zone[\"data-y\"]) + int(zone[\"data-h\"])\n", " # For a square image\n", " if bottom > top + (right - left):\n", " bottom = top + (right - left)\n", " return {\n", " \"page_id\": page_id,\n", " \"left\": left,\n", " \"top\": top,\n", " \"right\": right,\n", " \"bottom\": bottom,\n", " }\n", "\n", "\n", "def get_illustration(zone):\n", " page_id = zone[\"data-page-id\"]\n", " left = int(zone[\"data-x\"])\n", " right = int(zone[\"data-x\"]) + int(zone[\"data-w\"])\n", " top = int(zone[\"data-y\"])\n", " bottom = int(zone[\"data-y\"]) + int(zone[\"data-h\"])\n", " return {\n", " \"page_id\": page_id,\n", " \"left\": left,\n", " \"top\": top,\n", " \"right\": right,\n", " \"bottom\": bottom,\n", " }\n", "\n", "\n", "def get_article_box(article_url, illustrated=False):\n", " \"\"\"\n", " Positional information about the article is attached to each line of the OCR output in data attributes.\n", " This function loads the HTML version of the article and scrapes the x, y, and width values for each line of text\n", " to determine the coordinates of a box around the article.\n", " \"\"\"\n", " response = requests.get(article_url)\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Lines of OCR are in divs with the class 'zone'\n", " # 'onPage' limits to those on the current page\n", " illustrations = soup.select(\"div.illustration.onPage\")\n", " if illustrations and illustrated is True:\n", " zone = illustrations[0].parent\n", " box = get_illustration(zone)\n", " else:\n", " zones = soup.select(\"div.zone.onPage\")\n", " box = get_box(zones)\n", " return box\n", "\n", "\n", "def get_article_thumbnail(b):\n", " \"\"\"\n", " Extract a square thumbnail of the article from the page image.\n", " \"\"\"\n", " results.clear_output(wait=True)\n", " article_id = re.search(r\"article\\/{0,1}(\\d+)\", article_url.value).group(1)\n", " # Get position of article on the page(s)\n", " box = get_article_box(\n", " \"http://nla.gov.au/nla.news-article{}\".format(article_id),\n", " illustrated=illustrated.value,\n", " )\n", " # print(box)\n", " # Construct the url we need to download the page image\n", " page_url = (\n", " \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}\".format(\n", " box[\"page_id\"], 7\n", " )\n", " )\n", " # Download the page image\n", " response = requests.get(page_url)\n", " # Open download as an image for editing\n", " img = Image.open(BytesIO(response.content))\n", " # Use coordinates of top line to create a square box to crop thumbnail\n", " points = (box[\"left\"], box[\"top\"], box[\"right\"], box[\"bottom\"])\n", " # Crop image to article box\n", " thumb = img.crop(points)\n", " # Resize\n", " thumb.thumbnail((size.value, size.value), Image.ANTIALIAS)\n", " new_w, new_h = thumb.size\n", " # Squarify\n", " delta_w = size.value - new_w\n", " delta_h = size.value - new_h\n", " padding = (\n", " delta_w // 2,\n", " delta_h // 2,\n", " delta_w - (delta_w // 2),\n", " delta_h - (delta_h // 2),\n", " )\n", " thumb = ImageOps.expand(thumb, padding, fill=\"white\")\n", " # Create a filename for the thumbnail\n", " thumb_file = \"nla.news-article{}-{}.jpg\".format(article_id, size.value)\n", " # To avoid problems with saving & using local files, we're going to save the image as a file object\n", " # Create a file object to save the image into\n", " image_file = BytesIO()\n", " # Save the image into the file object\n", " thumb.save(image_file, \"JPEG\")\n", " # Go to the start of the file object\n", " image_file.seek(0)\n", " # For the download link we can use a data uri -- a base64 encoded version of the file\n", " # Encode the file\n", " encoded_image = base64.b64encode(image_file.read()).decode()\n", " # Create a data uri string\n", " encoded_string = \"data:image/png;base64,\" + encoded_image\n", " # Reset to the beginning\n", " image_file.seek(0)\n", " with results:\n", " # Create a download link using the data uri\n", " display(\n", " HTML(\n", " 'Download {0}'.format(\n", " thumb_file, encoded_string\n", " )\n", " )\n", " )\n", " # Display the image\n", " display(widgets.Image(value=image_file.read(), format=\"jpg\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enter an article url...\n", "\n", "You can use the url in your browser's location bar or an article permalink." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "article_url = widgets.Text(\n", " placeholder=\"Enter an article url\", description=\"Article/Page:\", disabled=False\n", ")\n", "display(article_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional settings\n", "\n", "Generate a square thumbnail with this height and width (in pixels)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "size = widgets.BoundedIntText(\n", " min=100, max=800, value=500, step=50, description=\"Size:\", disabled=False\n", ")\n", "display(size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there's an illustration in the article, check this box to use it as the thumbnail. The illustration will not be cropped, so whitespace will be added around the image to make it square." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "illustrated = widgets.Checkbox(\n", " value=False, description=\"Use illustration as thumbnail\", disabled=False\n", ")\n", "\n", "display(illustrated)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the thumbnail!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "button = widgets.Button(\n", " description=\"Get thumbnail\",\n", " disabled=False,\n", " button_style=\"primary\",\n", " tooltip=\"Click to download\",\n", " icon=\"\",\n", ")\n", "button.on_click(get_article_thumbnail)\n", "display(button)\n", "display(results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TESTING\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " article_url.value = \"https://trove.nla.gov.au/newspaper/article/61389505\"\n", " button.click()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }