{ "cells": [ { "cell_type": "markdown", "id": "58101f12-40e3-4d01-a697-db6486a16e2b", "metadata": {}, "source": [ "# Harvest the issues of a newspaper as PDFs\n", "\n", "This notebook harvests issues of a newspaper as PDFs – one PDF per issue. If the newspaper has an long print run, this will consume large amounts of time and disk space, so you might want to limit your harvest by date range.\n", "\n", "The downloaded PDFs are saved in the `data/issues` folder. The PDF file names have the following structure:\n", "\n", "```\n", "[newspaper identifier]-[issue date as YYYYMMDD]-[issue identifier].pdf\n", "```\n", "\n", "For example:\n", "\n", "```\n", "903-19320528-1791051.pdf\n", "```\n", "\n", "* `903` – the [Glen Innes Examiner](https://trove.nla.gov.au/newspaper/title/903)\n", "* `19320528` – 28 May 1932\n", "* `1791051` – you view in Trove just add this to `http://nla.gov.au/nla.news-issue`, eg http://nla.gov.au/nla.news-issue1791051" ] }, { "cell_type": "markdown", "id": "d8bca940-10fb-4a13-992c-2a72f62610fa", "metadata": {}, "source": [ "## Set up what we need\n", "\n", "Make sure you paste in your Trove API key where indicated." ] }, { "cell_type": "code", "execution_count": null, "id": "95a07c27-6dc7-4dec-ab6e-4a073b5b1e40", "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import time\n", "from pathlib import Path\n", "\n", "import arrow\n", "import pandas as pd\n", "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.exceptions import HTTPError\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": null, "id": "3285f3e0-dea8-4333-92a0-6ff782a05cfb", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Load variables from the .env file if it exists\n", "# Use %%capture to suppress messages\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": null, "id": "9bae9e81-a2fe-468a-a09c-92c605ed78ce", "metadata": {}, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")\n", "\n", "API_URL = \"https://api.trove.nla.gov.au/v2/newspaper/title/\"\n", "\n", "PARAMS = {\"encoding\": \"json\", \"key\": API_KEY}" ] }, { "cell_type": "markdown", "id": "7bda6cd4-dd89-4d19-bdfd-975f2d765c45", "metadata": {}, "source": [ "## Get information about available issues\n", "\n", "Before we start downloading huge numbers of PDFs, let's have a look at how many issues are available for the newspaper we're interested in. This code comes from [harvest_newspaper_issues.ipynb](harvest_newspaper_issues.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "id": "cb510caf-197b-4f55-89ad-8ecdb34b63d5", "metadata": {}, "outputs": [], "source": [ "# THIS CODE COMES FROM harvest_newspaper_issues.ipynb\n", "\n", "# These are newspapers where the date ranges are off by more than a year\n", "# In these cases we'll harvest all the issues in one hit, rather than year by year\n", "dodgy_dates = [\"1486\", \"1618\", \"586\"]\n", "\n", "\n", "def get_title_summary(title_id):\n", " \"\"\"\n", " Get the details of a single newspaper title.\n", " \"\"\"\n", " response = s.get(f\"{API_URL}{title_id}\", params=PARAMS)\n", " data = response.json()\n", " return data[\"newspaper\"]\n", "\n", "\n", "def get_issues_in_range(title_id, start_date, end_date):\n", " \"\"\"\n", " Get a list of issues available from a particular newspaper within the given date range.\n", " \"\"\"\n", " issues = []\n", " params = PARAMS.copy()\n", " params[\"include\"] = \"years\"\n", " params[\"range\"] = f'{start_date.format(\"YYYYMMDD\")}-{end_date.format(\"YYYYMMDD\")}'\n", " response = s.get(f\"{API_URL}{title_id}\", params=params)\n", " try:\n", " data = response.json()\n", " except json.JSONDecodeError:\n", " print(response.url)\n", " print(response.text)\n", " else:\n", " for year in data[\"newspaper\"][\"year\"]:\n", " if \"issue\" in year:\n", " for issue in year[\"issue\"]:\n", " issues.append(\n", " {\n", " \"title_id\": title_id,\n", " \"issue_id\": issue[\"id\"],\n", " \"issue_date\": issue[\"date\"],\n", " }\n", " )\n", " time.sleep(0.2)\n", " return issues\n", "\n", "\n", "def get_issues_full_range(title_id):\n", " \"\"\"\n", " In most cases we set date ranges to get issue data in friendly chunks. But sometimes the date ranges are missing or wrong.\n", " In these cases, we ask for everything at once, by setting the range to the limits of Trove.\n", " \"\"\"\n", " start_date = arrow.get(\"1803-01-01\")\n", " range_end = arrow.now()\n", " issues = get_issues_in_range(title_id, start_date, range_end)\n", " return issues\n", "\n", "\n", "def get_issues_from_title(title_id):\n", " \"\"\"\n", " Get a list of all the issues available for a particular newspaper.\n", "\n", " Params:\n", " * title_id - a newspaper identifier\n", " Returns:\n", " * A list containing details of available issues\n", " \"\"\"\n", " issues = []\n", " title_summary = get_title_summary(title_id)\n", "\n", " # Date range is off by more than a year, so get everything in one hit\n", " if title_id in dodgy_dates:\n", " issues += get_issues_full_range(title_id)\n", " else:\n", " try:\n", " # The date ranges are not always reliable, so to make sure we get everything\n", " # we'll set the range to the beginning and end of the given year\n", " start_date = arrow.get(title_summary[\"startDate\"]).replace(day=1, month=1)\n", " end_date = arrow.get(title_summary[\"endDate\"]).replace(day=31, month=12)\n", " except KeyError:\n", " # Some records have no start and end dates at all\n", " # In this case set the range to the full range of Trove's newspapers\n", " issues += get_issues_full_range(title_id)\n", " else:\n", " # If the date range is available, loop through it by year\n", " while start_date <= end_date:\n", " range_end = start_date.replace(month=12, day=31)\n", " issues += get_issues_in_range(title_id, start_date, range_end)\n", " start_date = start_date.shift(years=+1).replace(month=1, day=1)\n", " return issues" ] }, { "cell_type": "markdown", "id": "553069d0-b441-44c1-9143-9783937b999b", "metadata": {}, "source": [ "Harvest the issue data." ] }, { "cell_type": "code", "execution_count": null, "id": "ee1fd696-34c9-4bee-bb25-377faae59d02", "metadata": {}, "outputs": [], "source": [ "# Set the id of the newspaper you want to havrest from\n", "# You can get the newspaper id from the title details page in Trove\n", "trove_newspaper_id = 1646\n", "\n", "# Harvest the issue data\n", "issues = get_issues_from_title(trove_newspaper_id)" ] }, { "cell_type": "markdown", "id": "592e6304-19ee-4688-a406-5f582e912afe", "metadata": {}, "source": [ "Convert to a dataframe for analysis." ] }, { "cell_type": "code", "execution_count": null, "id": "cbca0363-d5b4-4288-b86c-0c9bdd7e2b9e", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(issues)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "8ef91889-a9f9-4427-8321-d6946a957add", "metadata": {}, "source": [ "How many issues are available?" ] }, { "cell_type": "code", "execution_count": null, "id": "2fd46be3-5ea4-4e1d-bd78-d2060c32c5c2", "metadata": {}, "outputs": [], "source": [ "df.shape[0]" ] }, { "cell_type": "markdown", "id": "190a6ba5-29bf-43bf-a9bd-ed59b3ae546d", "metadata": {}, "source": [ "What is the date range of the issues?" ] }, { "cell_type": "code", "execution_count": null, "id": "cf2270df-4a96-483b-a998-59f92a2bd755", "metadata": {}, "outputs": [], "source": [ "df[\"issue_date\"].min()" ] }, { "cell_type": "code", "execution_count": null, "id": "b5c3a66b-0a1f-484a-813c-3caca431602b", "metadata": {}, "outputs": [], "source": [ "df[\"issue_date\"].max()" ] }, { "cell_type": "markdown", "id": "8a1e4a8e-f98c-41f6-a67b-603e14b3fbf2", "metadata": {}, "source": [ "## Harvest the issues as PDFs\n", "\n", "Now we have the issues data, we can use it to download the PDFs." ] }, { "cell_type": "code", "execution_count": null, "id": "a9444014-868e-4921-823d-80d3d8b6f68a", "metadata": {}, "outputs": [], "source": [ "# THIS CODE IS A SLIGHTLY MODIFIED VERSION OF WHAT'S IN THE TROVE NEWSPAPER HARVESTER\n", "\n", "\n", "def ping_pdf(ping_url):\n", " \"\"\"\n", " Check to see if a PDF is ready for download.\n", " If a 200 status code is received, return True.\n", " \"\"\"\n", " ready = False\n", " # req = Request(ping_url)\n", " try:\n", " response = s.get(ping_url, timeout=30)\n", " response.raise_for_status()\n", " except HTTPError:\n", " if response.status_code == 423:\n", " ready = False\n", " else:\n", " raise\n", " else:\n", " ready = True\n", " return ready\n", "\n", "\n", "def get_pdf_url(issue_id):\n", " \"\"\"\n", " Download the PDF version of an article.\n", " These can take a while to generate, so we need to ping the server to see if it's ready before we download.\n", " \"\"\"\n", " pdf_url = None\n", " # Ask for the PDF to be created\n", " prep_url = (\n", " f\"https://trove.nla.gov.au/newspaper/rendition/nla.news-issue{issue_id}/prep\"\n", " )\n", " response = s.get(prep_url)\n", " # Get the hash\n", " prep_id = response.text\n", " # Url to check if the PDF is ready\n", " ping_url = f\"https://trove.nla.gov.au/newspaper/rendition/nla.news-issue{issue_id}.ping?followup={prep_id}\"\n", " tries = 0\n", " ready = False\n", " time.sleep(2) # Give some time to generate pdf\n", " # Are you ready yet?\n", " while ready is False and tries < 5:\n", " ready = ping_pdf(ping_url)\n", " if not ready:\n", " tries += 1\n", " time.sleep(2)\n", " # Download if ready\n", " if ready:\n", " pdf_url = f\"https://trove.nla.gov.au/newspaper/rendition/nla.news-issue{issue_id}.pdf?followup={prep_id}\"\n", " return pdf_url\n", "\n", "\n", "def harvest_pdfs(issues, start_date=None, end_date=None):\n", " \"\"\"\n", " Download all issue pdfs within the given date range.\n", " \"\"\"\n", " output_path = Path(\"data\", \"issues\")\n", " output_path.mkdir(parents=True, exist_ok=True)\n", " df = pd.DataFrame(issues)\n", " if start_date and end_date:\n", " df_range = df.loc[\n", " (df[\"issue_date\"] >= start_date) & (df[\"issue_date\"] <= end_date)\n", " ]\n", " elif start_date:\n", " df_range = df.loc[(df[\"issue_date\"] >= start_date)]\n", " elif end_date:\n", " df_range = df.loc[(df[\"issue_date\"] < end_date)]\n", " else:\n", " df_range = df\n", " for issue in tqdm(df_range.itertuples(), total=df_range.shape[0]):\n", " pdf_url = get_pdf_url(issue.issue_id)\n", " response = s.get(pdf_url)\n", " Path(\n", " output_path,\n", " f'{issue.title_id}-{issue.issue_date.replace(\"-\", \"\")}-{issue.issue_id}.pdf',\n", " ).write_bytes(response.content)" ] }, { "cell_type": "markdown", "id": "1e19ba5d-3cf4-4cbb-afb0-0c707a1ca968", "metadata": {}, "source": [ "In the cell below you can set a date range for your harvest. Adjust the start and end dates as required. If you want to harvest ALL the issues, set the start and end dates to `None`." ] }, { "cell_type": "code", "execution_count": null, "id": "7f23de9a-e43f-4290-a28f-7feefdf66e4f", "metadata": {}, "outputs": [], "source": [ "# Set start and end dates - YYYY-MM-DD, eg:\n", "# start_date = '1932-05-01'\n", "# Adjust these to suit your case, set to None to get everything\n", "start_date = None\n", "end_date = None\n", "\n", "# Start harvesting the PDFs!\n", "harvest_pdfs(issues, start_date=start_date, end_date=end_date)" ] }, { "cell_type": "markdown", "id": "7db58829-9650-43ae-aa3e-8fcad968f673", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }