{
"cells": [
{
"cell_type": "markdown",
"id": "846e5028-ca8c-419e-9239-48979fe4f729",
"metadata": {},
"source": [
"# Harvest information about newspaper issues\n",
"\n",
"When you search Trove's newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.\n",
"\n",
"The code below generates two datasets:\n",
"\n",
"* **Total number of issues per year for every newspaper** – 27,615 rows with the fields:\n",
" * `title` – newspaper title\n",
" * `title_id` – newspaper id\n",
" * `state` – place of publication\n",
" * `year` – year published\n",
" * `issues` – number of issues\n",
"* **Complete list of issues for every newspaper** – 2,655,664 rows with the fields:\n",
" * `title` – newspaper title\n",
" * `title_id` – newspaper id\n",
" * `state` – place of publication\n",
" * `issue_id` – issue identifier\n",
" * `issue_date` – date of publication (YYYY-MM-DD)\n",
"\n",
"These were harvested on 18 October 2021. You can download the pre-harvested datasets from CloudStor:\n",
"\n",
"* [newspaper_issues_totals_by_year_20211018.csv](https://cloudstor.aarnet.edu.au/plus/s/e4IdDT8Zbg0A27S) (2.1mb)\n",
"* [newspaper_issues_20211018.csv](https://cloudstor.aarnet.edu.au/plus/s/YjKNqjWqCYkdInI) (222mb)\n",
"\n",
"### Issue urls\n",
"\n",
"To keep the file size down, I haven't included an `issue_url` in the issues dataset, but these are easily generated from the `issue_id`. Just add the `issue_id` to the end of `http://nla.gov.au/nla.news-issue`. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3ede8c11-990b-4373-85a9-75e0fb02b2d9",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"import time\n",
"\n",
"import altair as alt\n",
"import arrow\n",
"import pandas as pd\n",
"import requests_cache\n",
"from requests.adapters import HTTPAdapter\n",
"from requests.packages.urllib3.util.retry import Retry\n",
"from tqdm.auto import tqdm\n",
"\n",
"# Create a session that will automatically retry on server errors\n",
"s = requests_cache.CachedSession(\"issues\")\n",
"retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
"s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
"s.mount(\"https://\", HTTPAdapter(max_retries=retries))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "40a18a1e-cd55-47e4-a264-8823691d4de7",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"# Load variables from the .env file if it exists\n",
"# Use %%capture to suppress messages\n",
"%load_ext dotenv\n",
"%dotenv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e4b5b5c5-d1cc-4f4c-8c21-a2057bc1c050",
"metadata": {},
"outputs": [],
"source": [
"# Insert your Trove API key\n",
"API_KEY = \"YOUR API KEY\"\n",
"\n",
"# Use api key value from environment variables if it is available\n",
"if os.getenv(\"TROVE_API_KEY\"):\n",
" API_KEY = os.getenv(\"TROVE_API_KEY\")\n",
"\n",
"API_URL = \"https://api.trove.nla.gov.au/v2/newspaper/title/\"\n",
"\n",
"PARAMS = {\"encoding\": \"json\", \"key\": API_KEY}"
]
},
{
"cell_type": "markdown",
"id": "76b3d96b-3740-4d80-8631-ed54e5f8583a",
"metadata": {},
"source": [
"## Total number of issues per year for every newspaper in Trove\n",
"\n",
"To get a list of all the newspapers in Trove you make a request to the `newspaper/titles` endpoint. This provides summary information about each title, but no data about issues.\n",
"\n",
"To get issue data you have to request information about each title separately, using the `newspaper/title/[title id]` endpoint. If you add `include=years` to the request, you get a list of years in which issues were published, and a total number of issues for each year. We can use this to aggregate information about the number of issues by title and year."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "26450b2d-143b-4ac3-9f37-5abb06835811",
"metadata": {},
"outputs": [],
"source": [
"def get_issues_by_year():\n",
" \"\"\"\n",
" Gets the total number of issues per year for each newspaper.\n",
"\n",
" Returns:\n",
" * A list of dicts, each containing the number of issues available from a newspaper in a particular year\n",
" \"\"\"\n",
" years = []\n",
"\n",
" # First we get a list of all the newspapers (and gazettes) in Trove\n",
" response = s.get(\"https://api.trove.nla.gov.au/v2/newspaper/titles/\", params=PARAMS)\n",
" data = response.json()\n",
" titles = data[\"response\"][\"records\"][\"newspaper\"]\n",
"\n",
" # Then we loop through all the newspapers to retrieve issue data\n",
" for title in tqdm(titles):\n",
" params = PARAMS.copy()\n",
"\n",
" # This parameter adds the number of issues per year to the newspaper data\n",
" params[\"include\"] = \"years\"\n",
" response = s.get(f'{API_URL}{title[\"id\"]}', params=params)\n",
" try:\n",
" data = response.json()\n",
" except json.JSONDecodeError:\n",
" print(response.url)\n",
" print(response.text)\n",
" else:\n",
" # Loop through all the years, saving the totals\n",
" for year in data[\"newspaper\"][\"year\"]:\n",
" years.append(\n",
" {\n",
" \"title\": title[\"title\"],\n",
" \"title_id\": title[\"id\"],\n",
" \"state\": title[\"state\"],\n",
" \"year\": year[\"date\"],\n",
" \"issues\": int(year[\"issuecount\"]),\n",
" }\n",
" )\n",
" if not response.from_cache:\n",
" time.sleep(0.2)\n",
" return years"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d3f90c8-75a6-48b0-9314-7d0bdb8abab1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"issue_totals = get_issues_by_year()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "399c2a40-9877-4806-b9fc-a1e41ba75372",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" title_id | \n",
" state | \n",
" year | \n",
" issues | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" 166 | \n",
" ACT | \n",
" 1925 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" 166 | \n",
" ACT | \n",
" 1926 | \n",
" 12 | \n",
"
\n",
" \n",
" 2 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" 166 | \n",
" ACT | \n",
" 1927 | \n",
" 9 | \n",
"
\n",
" \n",
" 3 | \n",
" Canberra Illustrated: A Quarterly Magazine (AC... | \n",
" 165 | \n",
" ACT | \n",
" 1925 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" Federal Capital Pioneer (Canberra, ACT : 1924 ... | \n",
" 69 | \n",
" ACT | \n",
" 1924 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title title_id state year \\\n",
"0 Canberra Community News (ACT : 1925 - 1927) 166 ACT 1925 \n",
"1 Canberra Community News (ACT : 1925 - 1927) 166 ACT 1926 \n",
"2 Canberra Community News (ACT : 1925 - 1927) 166 ACT 1927 \n",
"3 Canberra Illustrated: A Quarterly Magazine (AC... 165 ACT 1925 \n",
"4 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69 ACT 1924 \n",
"\n",
" issues \n",
"0 3 \n",
"1 12 \n",
"2 9 \n",
"3 1 \n",
"4 1 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Save results as a dataframe\n",
"df_totals = pd.DataFrame(issue_totals)\n",
"df_totals.head()"
]
},
{
"cell_type": "markdown",
"id": "9542b6ef-2c56-4d71-8f4c-b644db60f44a",
"metadata": {
"tags": []
},
"source": [
"How many issues are there?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9971748e-1e62-4571-962c-aa0152cad01d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"2666287"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_totals[\"issues\"].sum()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7b1de749-8d0a-40dc-a563-c21ed5352ace",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(27757, 5)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_totals.shape"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ccf8717c-0a0d-4e1d-accd-1628bcd0caba",
"metadata": {
"tags": [
"nbval-skip"
]
},
"outputs": [],
"source": [
"# Save as a CSV file\n",
"df_totals.to_csv(\n",
" f'newspaper_issues_totals_by_year_{arrow.now().format(\"YYYYMMDD\")}.csv', index=False\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f2ebe633-878f-4d40-9a49-a8fce3e855e3",
"metadata": {},
"source": [
"### Display the total number of issues per year\n",
"\n",
"By grouping the number of issues by year, we can see how the number of issues in Trove changes over time. It's interesting to compare this to the [number of articles over time](https://glam-workbench.net/trove-newspapers/#visualise-the-total-number-of-newspaper-articles-in-trove-by-year-and-state)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "87d561d3-ddcc-4a07-9d39-921e6127fcf5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Group by year and calculate sum of totals\n",
"df_years = df_totals.groupby(by=\"year\").sum().reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "52653761-aa53-4894-9359-0016fe89d9a4",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.Chart(...)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a chart\n",
"alt.Chart(df_years).mark_bar().encode(\n",
" x=alt.X(\"year:Q\", axis=alt.Axis(format=\"c\")),\n",
" y=\"issues:Q\",\n",
" tooltip=[\"year:O\", \"issues:Q\"],\n",
").properties(width=800)"
]
},
{
"cell_type": "markdown",
"id": "232bc11a-c93f-479b-b8ff-7ece9ff7c488",
"metadata": {},
"source": [
"## Harvest a complete list of issues\n",
"\n",
"We've found out how many issues were published, but not _when_ they were published. To get a complete list of issue dates and identifiers we have to add another parameter to our title API request. The `range` parameter sets a date range. If we add it to our request, the API will return information about all the issues within that date range.\n",
"\n",
"How do we set the `range`? The summary inforation for each title includes `startDate` and `endDate` fields. We could simply set the `range` using these, however, this could return a huge amount of data. It's best to be conservative, requesting the issue data in manageable chunks. The code below iterates over the complete date range for each title, requesting a year's worth of issues at a time. Note that the `range` parameter expects a date range in the format `YYYYMMDD-YYYYMMDD`. \n",
"\n",
"It turned out that some titles don't have start and end dates, and some of the start and end dates are wrong. I've found ways to work around these. See below for more information."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "34ee3876-1692-4a8d-9958-71b21b3c4c1f",
"metadata": {},
"outputs": [],
"source": [
"# These are newspapers where the date ranges are off by more than a year\n",
"# In these cases we'll harvest all the issues in one hit, rather than year by year\n",
"dodgy_dates = [\"1486\", \"1618\", \"586\"]\n",
"\n",
"\n",
"def get_title_summary(title_id):\n",
" \"\"\"\n",
" Get the details of a single newspaper title.\n",
" \"\"\"\n",
" response = s.get(f\"{API_URL}{title_id}\", params=PARAMS)\n",
" data = response.json()\n",
" return data[\"newspaper\"]\n",
"\n",
"\n",
"def get_issues_in_range(title_id, start_date, end_date):\n",
" \"\"\"\n",
" Get a list of issues available from a particular newspaper within the given date range.\n",
" \"\"\"\n",
" issues = []\n",
" params = PARAMS.copy()\n",
" params[\"include\"] = \"years\"\n",
" params[\"range\"] = f'{start_date.format(\"YYYYMMDD\")}-{end_date.format(\"YYYYMMDD\")}'\n",
" response = s.get(f\"{API_URL}{title_id}\", params=params)\n",
" try:\n",
" data = response.json()\n",
" except json.JSONDecodeError:\n",
" print(response.url)\n",
" print(response.text)\n",
" else:\n",
" for year in data[\"newspaper\"][\"year\"]:\n",
" if \"issue\" in year:\n",
" for issue in year[\"issue\"]:\n",
" issues.append(\n",
" {\n",
" \"title_id\": title_id,\n",
" \"issue_id\": issue[\"id\"],\n",
" \"issue_date\": issue[\"date\"],\n",
" }\n",
" )\n",
" if not response.from_cache:\n",
" time.sleep(0.2)\n",
" return issues\n",
"\n",
"\n",
"def get_issues_full_range(title_id):\n",
" \"\"\"\n",
" In most cases we set date ranges to get issue data in friendly chunks. But sometimes the date ranges are missing or wrong.\n",
" In these cases, we ask for everything at once, by setting the range to the limits of Trove.\n",
" \"\"\"\n",
" start_date = arrow.get(\"1803-01-01\")\n",
" range_end = arrow.now()\n",
" issues = get_issues_in_range(title_id, start_date, range_end)\n",
" return issues\n",
"\n",
"\n",
"def get_issues_from_title(title_id):\n",
" \"\"\"\n",
" Get a list of all the issues available for a particular newspaper.\n",
"\n",
" Params:\n",
" * title_id - a newspaper identifier\n",
" Returns:\n",
" * A list containing details of available issues\n",
" \"\"\"\n",
" issues = []\n",
" title_summary = get_title_summary(title_id)\n",
"\n",
" # Date range is off by more than a year, so get everything in one hit\n",
" if title_id in dodgy_dates:\n",
" issues += get_issues_full_range(title_id)\n",
" else:\n",
" try:\n",
" # The date ranges are not always reliable, so to make sure we get everything\n",
" # we'll set the range to the beginning and end of the given year\n",
" start_date = arrow.get(title_summary[\"startDate\"]).replace(day=1, month=1)\n",
" end_date = arrow.get(title_summary[\"endDate\"]).replace(day=31, month=12)\n",
" except KeyError:\n",
" # Some records have no start and end dates at all\n",
" # In this case set the range to the full range of Trove's newspapers\n",
" issues += get_issues_full_range(title_id)\n",
" else:\n",
" # If the date range is available, loop through it by year\n",
" while start_date <= end_date:\n",
" range_end = start_date.replace(month=12, day=31)\n",
" issues += get_issues_in_range(title_id, start_date, range_end)\n",
" start_date = start_date.shift(years=+1).replace(month=1, day=1)\n",
" return issues\n",
"\n",
"\n",
"def get_all_issues():\n",
" issues = []\n",
" response = s.get(\"https://api.trove.nla.gov.au/v2/newspaper/titles/\", params=PARAMS)\n",
" data = response.json()\n",
" titles = data[\"response\"][\"records\"][\"newspaper\"]\n",
" for title in tqdm(titles):\n",
" title_issues = get_issues_from_title(title[\"id\"])\n",
" issues += [\n",
" dict(i, title=title[\"title\"], state=title[\"state\"]) for i in title_issues\n",
" ]\n",
" return issues"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dccffd4f-2b8e-45c6-8814-146b9c85ceea",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"issues = get_all_issues()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "fdd1c45e-4d55-4982-91fb-c0b5efb42740",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"2666287"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(issues)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "7116dda9-ec50-4b99-b7f1-c64cd2db1c9a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title_id | \n",
" issue_id | \n",
" issue_date | \n",
" title | \n",
" state | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 166 | \n",
" 495445 | \n",
" 1925-10-14 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" ACT | \n",
"
\n",
" \n",
" 1 | \n",
" 166 | \n",
" 495422 | \n",
" 1925-11-11 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" ACT | \n",
"
\n",
" \n",
" 2 | \n",
" 166 | \n",
" 495423 | \n",
" 1925-12-11 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" ACT | \n",
"
\n",
" \n",
" 3 | \n",
" 166 | \n",
" 495424 | \n",
" 1926-01-11 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" ACT | \n",
"
\n",
" \n",
" 4 | \n",
" 166 | \n",
" 495425 | \n",
" 1926-02-11 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" ACT | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title_id issue_id issue_date title \\\n",
"0 166 495445 1925-10-14 Canberra Community News (ACT : 1925 - 1927) \n",
"1 166 495422 1925-11-11 Canberra Community News (ACT : 1925 - 1927) \n",
"2 166 495423 1925-12-11 Canberra Community News (ACT : 1925 - 1927) \n",
"3 166 495424 1926-01-11 Canberra Community News (ACT : 1925 - 1927) \n",
"4 166 495425 1926-02-11 Canberra Community News (ACT : 1925 - 1927) \n",
"\n",
" state \n",
"0 ACT \n",
"1 ACT \n",
"2 ACT \n",
"3 ACT \n",
"4 ACT "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_issues = pd.DataFrame(issues)\n",
"df_issues.head()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b75ee5a6-a438-4b93-9811-e8da812b0ffd",
"metadata": {
"tags": [
"nbval-skip"
]
},
"outputs": [],
"source": [
"df_issues.to_csv(f'newspaper_issues_{arrow.now().format(\"YYYYMMDD\")}.csv', index=False)"
]
},
{
"cell_type": "markdown",
"id": "f64eaf70-330f-409c-9fb2-8d37ea18edef",
"metadata": {},
"source": [
"### Check to see what's missing\n",
"\n",
"I ran the code below a few times in order to identify problems with the harvest. It helped me track down newspapers that had errors in the date ranges. Most of the errors were small and I could pick up any missing issues by expanding the date range to cover a whole year. But in some cases, the date ranges missed multiple years. To get the issues in these missing years, I created a `dodgy_dates` list. Any newspapers in this list are processed differently – the given date range is ignored, and instead the range is set to cover the period from 1803 to the present! Titles that are missing start and end dates are treated the same way. Once these fixes were included, the only missing issues left were from the _Noosa News_ – requesting issues from this newspaper originally caused an error, but this bug has now been fixed by Trove so no issues are missing!"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "3ceb77a0-e561-4b39-9ccd-814cfe603199",
"metadata": {
"tags": [
"nbval-skip"
]
},
"outputs": [],
"source": [
"# Compare the total number of issues reported by the API with the number actually harvested\n",
"# This helps us identify cases where the harvest has failed for some reason.\n",
"missing = 0\n",
"for title, years in df_totals.groupby(by=[\"title_id\", \"title\"]):\n",
" num_issues = df_issues.loc[df_issues[\"title_id\"] == title[0]].shape[0]\n",
" if years[\"issues\"].sum() != num_issues:\n",
" print(title[0], title[1])\n",
" print(f'Year totals: {years[\"issues\"].sum()}')\n",
" print(f\"Issues harvested: {num_issues}\")\n",
" missing += years[\"issues\"].sum() - num_issues"
]
},
{
"cell_type": "markdown",
"id": "7e62ceeb-68fa-4762-8f2e-d62c8ef20b32",
"metadata": {},
"source": [
"----\n",
"\n",
"Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}