{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Save a Trove newspaper article as an image\n", "\n", "Sometimes you want to be able to save a Trove newspaper article as an image. Unfortunately, the Trove web interface doesn't make this easy. The 'Download JPG' option actually loads an HTML page, and while you could individually save the images embedded in the HTML page, often articles are sliced up in ways that make the whole thing hard to read and use.\n", "\n", "One alternative is to [download the complete page](Save-page-image.ipynb) on which an article is published. I've also created a notebook that [generates a nice-looking thumbnail](Get-article-thumbnail.ipynb) for an article. This notebook takes things one step further – it grabs the page on which an article was published, but then it crops the page image to the boundaries of the article. The result is an image which presents the article as it was originally published.\n", "\n", "This is possible because information about the position of each line of text in an article is embedded in the display of the OCRd text. This notebook gathers all that positional information and uses it to draw a box that around the article. The OCRd text display also includes information about any additional parts of the article that are published on other pages. This means we can grab images of the article from every page on which it appears. So an article published across three pages, will generate three images.\n", "\n", "Here's an example. This is a [large, illustrated article](https://trove.nla.gov.au/newspaper/article/162833980) that is spread across two pages. If you download the JPG or PDF versions from Trove, you'll see they're a bit of a mess.\n", "\n", "\n", "\n", "Here are the two images of this article extracted by this notebook.\n", "\n", "