{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Get covers (or any other pages) from a digitised journal in Trove\n", "\n", "In [another notebook](Get-text-from-a-Trove-journal.ipynb), I showed how to get issue metadata and OCRd texts from a digitised journal in Trove. It's also possible to download page images and PDFs. This notebook shows how to download all the cover images from a specified journal. With some minor modifications you could download any page, or range of pages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Let's import the libraries we need.\n", "import io\n", "import os\n", "import re\n", "import shutil\n", "import time\n", "import zipfile\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from IPython.display import HTML, display\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What journal do you want?\n", "\n", "In the cell below, replace the `nla.obj-...` value with the identifier of the journal you want to harvest. You'll find the identifier in the url of the journal's landing page. An easy way to find it is to go to the [Trove Titles app](https://trove-titles.herokuapp.com/) and click on the 'Browse issues' button for the journal you're interested in.\n", "\n", "For example, if I click on the 'Browse issues' button for the *Angry Penguins broadsheet* it opens `http://nla.gov.au/nla.obj-320790312`, so the journal identifier is `nla.obj-320790312`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Replace the value in the single quotes with the identifier of your chosen journal\n", "journal_id = \"nla.obj-320790312\"\n", "# Where do you want to save the results?\n", "output_dir = \"images\"\n", "\n", "# Set up the data directory\n", "image_dir = os.path.join(output_dir, journal_id)\n", "os.makedirs(image_dir, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions to do the work" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def harvest_metadata(obj_id):\n", " \"\"\"\n", " This calls an internal API from a journal landing page to extract a list of available issues.\n", " \"\"\"\n", " start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " issues = []\n", " with tqdm(desc=\"Issues\", leave=False) as pbar:\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(start_url.format(obj_id, start), timeout=60)\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_=\"l-item-info\")\n", " for detail in details:\n", " issue = {}\n", " title = detail.find(\"h3\")\n", " if title:\n", " issue[\"title\"] = title.text\n", " issue[\"id\"] = title.parent[\"href\"].strip(\"/\")\n", " else:\n", " issue[\"title\"] = \"No title\"\n", " issue[\"id\"] = detail.find(\"a\")[\"href\"].strip(\"/\")\n", " try:\n", " # Get the issue details\n", " issue[\"details\"] = detail.find(\n", " class_=\"obj-reference content\"\n", " ).string.strip()\n", " except (AttributeError, IndexError):\n", " issue[\"details\"] = \"issue\"\n", " # Get the number of pages\n", " try:\n", " issue[\"pages\"] = int(\n", " re.search(\n", " r\"^(\\d+)\",\n", " detail.find(\"a\", attrs={\"data-pid\": issue[\"id\"]}).text,\n", " flags=re.MULTILINE,\n", " ).group(1)\n", " )\n", " except AttributeError:\n", " issue[\"pages\"] = 0\n", " issues.append(issue)\n", " # print(issue)\n", " if not response.from_cache:\n", " time.sleep(0.5)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " pbar.update(n)\n", " return issues\n", "\n", "\n", "def save_page(issues, output_dir, page_num=1):\n", " \"\"\"\n", " Downloads the specified page from a list of journal issues.\n", " If you want to download a range of pages you can set the `lastPage` parameter to your end point.\n", " But beware the images are pretty large.\n", " \"\"\"\n", " # Loop through the issue metadata\n", " for issue in tqdm(issues):\n", " # print(issue['id'])\n", " id = issue[\"id\"]\n", " # Check to see if the page of this issue has already been downloaded\n", " if not os.path.exists(\n", " os.path.join(image_dir, \"{}-{}.jpg\".format(id, page_num))\n", " ):\n", " # Change lastPage to download a range of pages\n", " url = \"https://nla.gov.au/{0}/download?downloadOption=zip&firstPage={1}&lastPage={1}\".format(\n", " id, page_num - 1\n", " )\n", " # Get the file\n", " r = s.get(url, timeout=60)\n", " # print(r.url, r.status_code)\n", " # The image is in a zip, so we need to extract the contents into the output directory\n", " z = zipfile.ZipFile(io.BytesIO(r.content))\n", " z.extractall(image_dir)\n", " time.sleep(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a list of issues\n", "\n", "Run the cell below to extract a list of issues for your selected journal and save them to the `issues` variable." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Issues: 0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "issues = harvest_metadata(journal_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the list of issues to a Pandas dataframe and have a look inside." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleiddetailspages
0Angry Penguins broadsheet.nla.obj-320791009Collection No. 116
1Angry Penguins broadsheet.nla.obj-320791023Collection No. 216
2Angry Penguins broadsheet.nla.obj-320791046Collection No. 316
3Angry Penguins broadsheet.nla.obj-320791067Collection No. 416
4Angry Penguins broadsheet.nla.obj-320791128Collection No. 5 (May, 1946)16
\n", "
" ], "text/plain": [ " title id \\\n", "0 Angry Penguins broadsheet. nla.obj-320791009 \n", "1 Angry Penguins broadsheet. nla.obj-320791023 \n", "2 Angry Penguins broadsheet. nla.obj-320791046 \n", "3 Angry Penguins broadsheet. nla.obj-320791067 \n", "4 Angry Penguins broadsheet. nla.obj-320791128 \n", "\n", " details pages \n", "0 Collection No. 1 16 \n", "1 Collection No. 2 16 \n", "2 Collection No. 3 16 \n", "3 Collection No. 4 16 \n", "4 Collection No. 5 (May, 1946) 16 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(issues)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the data to a CSV file." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"{}/issues.csv\".format(image_dir), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the images\n", "\n", "Run the cell below to work through the list of issues, downloading the first page of each, and saving it to the specified directory. Note that the images can be quite large!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "save_page(issues, image_dir, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the results\n", "\n", "If you're running this notebook using a cloud service (like Binder), you'll want to download your results. The cell below zips up the journal directory and creates a link for easy download." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Download results" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "images/nla.obj-320790312.zip" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "shutil.make_archive(image_dir, \"zip\", image_dir)\n", "display(HTML(\"Download results\"))\n", "display(\n", " HTML(f'{image_dir}.zip')\n", ")" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).\n", "\n", "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "category": "Harvesting images", "description": "This notebook shows how to download all the cover images from a specified periodical. With some minor modifications you could download any page, or range of pages.", "mainEntityOfPage": "https://glam-workbench.net/trove-journals/get-covers-from-digitised-journal/", "name": "Get covers (or any other pages) from a digitised journal in Trove", "position": 7 } }, "nbformat": 4, "nbformat_minor": 4 }