{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create 'scissors and paste' messages from Trove newspaper articles\n", "When you search for a term in Trove's digitised newspapers and click on individual article, you'll see your search terms are highlighted. If you look at the code you'll see the highlighted box around the word includes its page coordinates. That means that if we search for a word, we can find where it appears on a page, and by cropping the page to those coordinates we can create an image of an individual word. By combining these images we can create scissors and paste style messages!\n", "\n", "![Scissors and paste example - help trapped inside Trove cannot escape](images/trapped-trove.jpg)\n", "![Scissors and paste example - optical character recognition has achieved consciousness](images/ocr-consciousness.jpg)\n", "\n", "In this notebook you can create your own 'scissors and paste' messages: \n", "\n", "1. Enter a word in the box below and click **Add word**. \n", "2. Because of OCR errors the result might not be what you want, click **Retry last word** to make another attempt.\n", "3. When you've finished adding words, click on the 'Download' link to save.\n", "4. To start again click **Clear all**.\n", "\n", "Some things to note:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This notebook is designed to run in Voila\n", "# If you can see the code, just select 'Run all cells' from the menu to create the widgets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import what we need\n", "import base64\n", "import os\n", "import random\n", "from io import BytesIO\n", "\n", "import ipywidgets as widgets\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from IPython.display import HTML, display\n", "from PIL import Image" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Load env variables\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Some global variables\n", "words = []\n", "last_kw = \"\"\n", "\n", "# Widgets\n", "results = widgets.Output()\n", "status = widgets.Output()\n", "key = widgets.Text(description=\"Your API key:\")\n", "word_to_add = widgets.Text(description=\"Word to add:\")\n", "button_add = widgets.Button(\n", " description=\"Add word\",\n", " button_style=\"primary\",\n", ")\n", "button_retry = widgets.Button(description=\"Retry last word\")\n", "button_clear = widgets.Button(description=\"Clear all\")\n", "\n", "\n", "def get_word_boxes(article_url):\n", " \"\"\"\n", " Get the boxes around highlighted search terms.\n", " \"\"\"\n", " boxes = []\n", " # Get the article page\n", " response = requests.get(article_url)\n", " # Load in BS4\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Get the id of the newspaper page\n", " page_id = soup.select(\"div.zone.onPage\")[0][\"data-page-id\"]\n", " # Find the highlighted terms\n", " words = soup.select(\"span.highlightedTerm\")\n", " # Save the box coords\n", " for word in words:\n", " box = {\n", " \"page_id\": page_id,\n", " \"left\": int(word[\"data-x\"]),\n", " \"top\": int(word[\"data-y\"]),\n", " \"width\": int(word[\"data-w\"]),\n", " \"height\": int(word[\"data-h\"]),\n", " }\n", " boxes.append(box)\n", " return boxes\n", "\n", "\n", "def crop_word(box):\n", " \"\"\"\n", " Crop the box coordinates from the full page image.\n", " \"\"\"\n", " global words\n", " # Construct the url we need to download the page image\n", " page_url = (\n", " \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}\".format(\n", " box[\"page_id\"], 7\n", " )\n", " )\n", " # Download the page image\n", " response = requests.get(page_url)\n", " # Open download as an image for editing\n", " img = Image.open(BytesIO(response.content))\n", " word = img.crop(\n", " (\n", " box[\"left\"] - 5,\n", " box[\"top\"] - 5,\n", " box[\"left\"] + box[\"width\"] + 5,\n", " box[\"top\"] + box[\"height\"] + 5,\n", " )\n", " )\n", " img.close()\n", " words.append(word)\n", " display_words()\n", "\n", "\n", "def get_word_sizes():\n", " \"\"\"\n", " Get the max word height and total length of the saved words.\n", " These will be used for the height and width of the composite image.\n", " \"\"\"\n", " max_height = 0\n", " total_width = 0\n", " for word in words:\n", " max_height = max(max_height, word.size[1])\n", " total_width += word.size[0] + 10\n", " return max_height + 20, total_width + 10\n", "\n", "\n", "def display_words():\n", " \"\"\"\n", " Create a composite image from the saved words and display with a Download link.\n", " \"\"\"\n", " global words, word\n", " results.clear_output()\n", " word_to_add.value = \"\"\n", " h, w = get_word_sizes()\n", " # Create a new composite image\n", " comp = Image.new(\"RGB\", (w, h), (180, 180, 180))\n", " left = 10\n", " # Paste the words into the composite, adjusting the left (start) position as we go\n", " for word in words:\n", " # Centre the word in the composite\n", " top = int(round((h / 2) - (word.size[1] / 2)))\n", " comp.paste(word, (left, top))\n", " # Move along the start point\n", " left += word.size[0] + 10\n", " # Create a buffer\n", " image_file = BytesIO()\n", " # Save the image into the file object\n", " comp.save(image_file, \"JPEG\")\n", " # Go to the start of the file object\n", " image_file.seek(0)\n", " # For the download link we can use a data uri -- a base64 encoded version of the file\n", " # Encode the file\n", " encoded_image = base64.b64encode(image_file.read()).decode()\n", " # Create a data uri string\n", " encoded_string = \"data:image/png;base64,\" + encoded_image\n", " # Reset to the beginning\n", " image_file.seek(0)\n", " status.clear_output()\n", " with results:\n", " # Create a download link using the data uri\n", " display(HTML(f'Download'))\n", " # Display the image\n", " display(widgets.Image(value=image_file.read(), format=\"jpg\"))\n", "\n", "\n", "def retry_last_word(b):\n", " global words\n", " words.pop()\n", " get_article_from_search(last_kw)\n", "\n", "\n", "def clear_all_words(b):\n", " global words, last_kw\n", " results.clear_output()\n", " words.clear()\n", " word_to_add.value = \"\"\n", " last_kw = \"\"\n", "\n", "\n", "def get_article_from_search(kw):\n", " \"\"\"\n", " Use the Trove API to find articles with the supplied keyword.\n", "\n", " \"\"\"\n", " global last_kw\n", " last_kw = kw\n", " with status:\n", " display(HTML(\"Finding word...\"))\n", " params = {\n", " \"q\": f'text:\"{kw}\"',\n", " \"zone\": \"newspaper\",\n", " \"encoding\": \"json\",\n", " \"n\": 100,\n", " \"key\": key.value,\n", " }\n", " response = requests.get(\"https://api.trove.nla.gov.au/v2/result\", params=params)\n", " data = response.json()\n", " articles = data[\"response\"][\"zone\"][0][\"records\"][\"article\"]\n", " boxes = []\n", " # Choose article at random and look for highlight boxes\n", " # Continue until some boxes are found\n", " while len(boxes) == 0:\n", " article = random.choice(articles)\n", " # print(article['troveUrl'])\n", " try:\n", " boxes = get_word_boxes(article[\"troveUrl\"])\n", " except KeyError:\n", " pass\n", " crop_word(random.choice(boxes))\n", "\n", "\n", "def add_word(b):\n", " get_article_from_search(word_to_add.value)\n", "\n", "\n", "# Add events to the buttons\n", "button_add.on_click(add_word)\n", "button_retry.on_click(retry_last_word)\n", "button_clear.on_click(clear_all_words)\n", "\n", "# Display the widgets\n", "display(key)\n", "display(word_to_add)\n", "display(widgets.HBox([button_add, button_retry, button_clear]))\n", "display(status)\n", "display(results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TESTING\n", "\n", "if os.getenv(\"GW_STATUS\") == \"dev\" and os.getenv(\"TROVE_API_KEY\"):\n", " key.value = os.getenv(\"TROVE_API_KEY\")\n", " word_to_add.value = \"testing\"\n", " button_add.click()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb).\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }