{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Get covers (or any other pages) from a digitised journal in Trove\n", "\n", "In [another notebook](Get-text-from-a-Trove-journal.ipynb), I showed how to get issue metadata and OCRd texts from a digitised journal in Trove. It's also possible to download page images and PDFs. This notebook shows how to download all the cover images from a specified journal. With some minor modifications you could download any page, or range of pages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Let's import the libraries we need.\n", "import io\n", "import os\n", "import re\n", "import shutil\n", "import time\n", "import zipfile\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from IPython.display import HTML, display\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What journal do you want?\n", "\n", "In the cell below, replace the `nla.obj-...` value with the identifier of the journal you want to harvest. You'll find the identifier in the url of the journal's landing page. An easy way to find it is to go to the [Trove Titles app](https://trove-titles.herokuapp.com/) and click on the 'Browse issues' button for the journal you're interested in.\n", "\n", "For example, if I click on the 'Browse issues' button for the *Angry Penguins broadsheet* it opens `http://nla.gov.au/nla.obj-320790312`, so the journal identifier is `nla.obj-320790312`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Replace the value in the single quotes with the identifier of your chosen journal\n", "journal_id = \"nla.obj-320790312\"\n", "# Where do you want to save the results?\n", "output_dir = \"images\"\n", "\n", "# Set up the data directory\n", "image_dir = os.path.join(output_dir, journal_id)\n", "os.makedirs(image_dir, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions to do the work" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def harvest_metadata(obj_id):\n", " \"\"\"\n", " This calls an internal API from a journal landing page to extract a list of available issues.\n", " \"\"\"\n", " start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " issues = []\n", " with tqdm(desc=\"Issues\", leave=False) as pbar:\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(start_url.format(obj_id, start), timeout=60)\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_=\"l-item-info\")\n", " for detail in details:\n", " issue = {}\n", " title = detail.find(\"h3\")\n", " if title:\n", " issue[\"title\"] = title.text\n", " issue[\"id\"] = title.parent[\"href\"].strip(\"/\")\n", " else:\n", " issue[\"title\"] = \"No title\"\n", " issue[\"id\"] = detail.find(\"a\")[\"href\"].strip(\"/\")\n", " try:\n", " # Get the issue details\n", " issue[\"details\"] = detail.find(\n", " class_=\"obj-reference content\"\n", " ).string.strip()\n", " except (AttributeError, IndexError):\n", " issue[\"details\"] = \"issue\"\n", " # Get the number of pages\n", " try:\n", " issue[\"pages\"] = int(\n", " re.search(\n", " r\"^(\\d+)\",\n", " detail.find(\"a\", attrs={\"data-pid\": issue[\"id\"]}).text,\n", " flags=re.MULTILINE,\n", " ).group(1)\n", " )\n", " except AttributeError:\n", " issue[\"pages\"] = 0\n", " issues.append(issue)\n", " # print(issue)\n", " if not response.from_cache:\n", " time.sleep(0.5)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " pbar.update(n)\n", " return issues\n", "\n", "\n", "def save_page(issues, output_dir, page_num=1):\n", " \"\"\"\n", " Downloads the specified page from a list of journal issues.\n", " If you want to download a range of pages you can set the `lastPage` parameter to your end point.\n", " But beware the images are pretty large.\n", " \"\"\"\n", " # Loop through the issue metadata\n", " for issue in tqdm(issues):\n", " # print(issue['id'])\n", " id = issue[\"id\"]\n", " # Check to see if the page of this issue has already been downloaded\n", " if not os.path.exists(\n", " os.path.join(image_dir, \"{}-{}.jpg\".format(id, page_num))\n", " ):\n", " # Change lastPage to download a range of pages\n", " url = \"https://nla.gov.au/{0}/download?downloadOption=zip&firstPage={1}&lastPage={1}\".format(\n", " id, page_num - 1\n", " )\n", " # Get the file\n", " r = s.get(url, timeout=60)\n", " # print(r.url, r.status_code)\n", " # The image is in a zip, so we need to extract the contents into the output directory\n", " z = zipfile.ZipFile(io.BytesIO(r.content))\n", " z.extractall(image_dir)\n", " time.sleep(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a list of issues\n", "\n", "Run the cell below to extract a list of issues for your selected journal and save them to the `issues` variable." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Issues: 0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "issues = harvest_metadata(journal_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the list of issues to a Pandas dataframe and have a look inside." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | title | \n", "id | \n", "details | \n", "pages | \n", "
---|---|---|---|---|
0 | \n", "Angry Penguins broadsheet. | \n", "nla.obj-320791009 | \n", "Collection No. 1 | \n", "16 | \n", "
1 | \n", "Angry Penguins broadsheet. | \n", "nla.obj-320791023 | \n", "Collection No. 2 | \n", "16 | \n", "
2 | \n", "Angry Penguins broadsheet. | \n", "nla.obj-320791046 | \n", "Collection No. 3 | \n", "16 | \n", "
3 | \n", "Angry Penguins broadsheet. | \n", "nla.obj-320791067 | \n", "Collection No. 4 | \n", "16 | \n", "
4 | \n", "Angry Penguins broadsheet. | \n", "nla.obj-320791128 | \n", "Collection No. 5 (May, 1946) | \n", "16 | \n", "