{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gathering historical data about the addition of newspaper titles to Trove\n", "\n", "The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. This notebook uses web archives to extract lists of newspapers in Trove over time, and chart Trove's development.\n", "\n", "Trove has always provided a browseable list of digitised newspaper titles. The url and format of this list has changed over time, but it's possible to find captures of this page in the Internet Archive and extract the full list of titles. The pages are also captured in the Australian Web Archive, but the Wayback Machine has a more detailed record.\n", "\n", "The pages that I'm looking for are:\n", "\n", "* [http://trove.nla.gov.au/ndp/del/titles](https://web.archive.org/web/*/http://trove.nla.gov.au/ndp/del/titles)\n", "* [https://trove.nla.gov.au/newspaper/about](https://web.archive.org/web/*/https://trove.nla.gov.au/newspaper/about)\n", "\n", "This notebook creates the following data files:\n", "\n", "* [trove_newspaper_titles_2009_2021.csv](https://github.com/GLAM-Workbench/trove-newspapers/blob/master/trove_newspaper_titles_2009_2021.csv) – complete dataset of captures and titles\n", "* [trove_newspaper_titles_first_appearance_2009_2021.csv](https://github.com/GLAM-Workbench/trove-newspapers/blob/master/trove_newspaper_titles_first_appearance_2009_2021.csv) – filtered dataset, showing only the first appearance of each title / place / date range combination\n", "\n", "I've also created a [browseable list of titles](https://gist.github.com/wragge/7d80507c3e7957e271c572b8f664031a), showing when they first appeared in Trove." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "from pathlib import Path\n", "\n", "import altair as alt\n", "import arrow\n", "import pandas as pd\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from surt import surt\n", "\n", "s = requests_cache.CachedSession(\"archived_titles\")\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code for harvesting web archive captures\n", "\n", "We're using the Memento protocol to get a list of captures. See the [Web Archives section](https://glam-workbench.net/web-archives/) of the GLAM Workbench for more details." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# The code in this cell is copied from notebooks in the Web Archives section of the GLAM Workbench (https://glam-workbench.net/web-archives/)\n", "# In particular see: https://glam-workbench.net/web-archives/#find-all-the-archived-versions-of-a-web-page\n", "\n", "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " \"awa\": \"https://web.archive.org.au/awa/\",\n", " \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/wayback/\",\n", " \"ukwa\": \"https://www.webarchive.org.uk/wayback/en/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", "}\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def get_capture_data_from_memento(url, request_type=\"head\"):\n", " \"\"\"\n", " For OpenWayback systems this can get some extra capture info to insert into Timemaps.\n", " \"\"\"\n", " if request_type == \"head\":\n", " response = s.head(url)\n", " else:\n", " response = s.get(url)\n", " headers = response.headers\n", " length = headers.get(\"x-archive-orig-content-length\")\n", " status = headers.get(\"x-archive-orig-status\")\n", " status = status.split(\" \")[0] if status else None\n", " mime = headers.get(\"x-archive-orig-content-type\")\n", " mime = mime.split(\";\")[0] if mime else None\n", " return {\"length\": length, \"status\": status, \"mime\": mime}\n", "\n", "\n", "def convert_link_to_json(results, enrich_data=False):\n", " \"\"\"\n", " Converts link formatted Timemap to JSON.\n", " \"\"\"\n", " data = []\n", " for line in results.splitlines():\n", " parts = line.split(\"; \")\n", " if len(parts) > 1:\n", " link_type = re.search(\n", " r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n", " parts[1],\n", " ).group(1)\n", " if link_type == \"memento\":\n", " link = parts[0].strip(\"<>\")\n", " timestamp, original = re.search(r\"/(\\d{14})/(.*)$\", link).groups()\n", " capture = {\n", " \"urlkey\": surt(original),\n", " \"timestamp\": timestamp,\n", " \"url\": original,\n", " }\n", " if enrich_data:\n", " capture.update(get_capture_data_from_memento(link))\n", " print(capture)\n", " data.append(capture)\n", " return data\n", "\n", "\n", "def get_timemap_as_json(timegate, url, enrich_data=False):\n", " \"\"\"\n", " Get a Timemap then normalise results (if necessary) to return a list of dicts.\n", " \"\"\"\n", " tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n", " response = s.get(tg_url)\n", " response_type = response.headers[\"content-type\"]\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " elif response_type in [\"application/link-format\", \"text/html;charset=utf-8\"]:\n", " data = convert_link_to_json(response.text, enrich_data=enrich_data)\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvest the title data from the Internet Archive\n", "\n", "This gets the web page captures from the Internet Archive, scrapes the list of titles from the page, then does a bit of normalisation of the title data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "titles = []\n", "\n", "# These are the pages that listed available titles.\n", "# There was a change in 2016\n", "pages = [\n", " {\"url\": \"http://trove.nla.gov.au/ndp/del/titles\", \"path\": \"/ndp/del/title/\"},\n", " {\"url\": \"https://trove.nla.gov.au/newspaper/about\", \"path\": \"/newspaper/title/\"},\n", "]\n", "\n", "for page in pages:\n", " for capture in get_timemap_as_json(\"ia\", page[\"url\"]):\n", " if capture[\"status\"] == \"200\":\n", " url = f'https://web.archive.org/web/{capture[\"timestamp\"]}id_/{capture[\"url\"]}'\n", " # print(url)\n", " capture_date = arrow.get(capture[\"timestamp\"][:8], \"YYYYMMDD\").format(\n", " \"YYYY-MM-DD\"\n", " )\n", " # print(capture_date)\n", " response = s.get(url)\n", " soup = BeautifulSoup(response.content)\n", " title_links = soup.find_all(\"a\", href=re.compile(page[\"path\"]))\n", " for title in title_links:\n", " # Get the title text\n", " full_title = title.get_text().strip()\n", "\n", " # Get the title id\n", " title_id = re.search(r\"\\/(\\d+)\\/?$\", title[\"href\"]).group(1)\n", "\n", " # Most of the code below is aimed at normalising the publication place and dates values to allow for easy grouping & deduplication\n", " brief_title = re.sub(r\"\\(.+\\)\\s*$\", \"\", full_title).strip()\n", " try:\n", " details = re.search(r\"\\((.+)\\)\\s*$\", full_title).group(1).split(\":\")\n", " except AttributeError:\n", " place = \"\"\n", " dates = \"\"\n", " else:\n", " try:\n", " place = details[0].strip()\n", " # Normalise states\n", " try:\n", " place = re.sub(\n", " r\"(, )?([A-Za-z]+)[\\.\\s]*$\",\n", " lambda match: f'{match.group(1) if match.group(1) else \"\"}{match.group(2).upper()}',\n", " place,\n", " )\n", " except AttributeError:\n", " pass\n", " # Normalise dates\n", " dates = \" - \".join(\n", " [d.strip() for d in details[1].strip().split(\"-\")]\n", " )\n", " except IndexError:\n", " place = \"\"\n", " dates = \" - \".join(\n", " [d.strip() for d in details[0].strip().split(\"-\")]\n", " )\n", " titles.append(\n", " {\n", " \"title_id\": title_id,\n", " \"full_title\": full_title,\n", " \"title\": brief_title,\n", " \"place\": place,\n", " \"dates\": dates,\n", " \"capture_date\": capture_date,\n", " \"capture_timestamp\": capture[\"timestamp\"],\n", " }\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert the title data to a DataFrame for analysis" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(titles)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
title_idfull_titletitleplacedatescapture_datecapture_timestamp
034Advertiser (Adelaide, SA : 1889-1931)AdvertiserAdelaide, SA1889 - 19312009-11-1220091112000713
113Argus (Melbourne, Vic. : 1848-1954)ArgusMelbourne, VIC1848 - 19542009-11-1220091112000713
216Brisbane Courier (Qld. : 1864-1933)Brisbane CourierQLD1864 - 19332009-11-1220091112000713
311Canberra Times (ACT : 1926-1954)Canberra TimesACT1926 - 19542009-11-1220091112000713
424Colonial Times (Hobart, Tas. : 1828-1857)Colonial TimesHobart, TAS1828 - 18572009-11-1220091112000713
........................
1070171331South Australian Record and Australasian and S...South Australian Record and Australasian and S...London, ENGLAND1840 - 18412022-01-1620220116142742
1070181369Territory of Papua Government Gazette (Papua N...Territory of Papua Government GazettePapua New GUINEA1906 - 19422022-01-1620220116142742
1070191371Territory of Papua and New Guinea Government G...Territory of Papua and New Guinea Government G...1949 - 19712022-01-1620220116142742
1070201370Territory of Papua-New Guinea Government Gazet...Territory of Papua-New Guinea Government Gazette1945 - 19492022-01-1620220116142742
1070211391Tribune (Philippines : 1932 - 1945)TribunePHILIPPINES1932 - 19452022-01-1620220116142742
\n", "

107022 rows × 7 columns

\n", "
" ], "text/plain": [ " title_id full_title \\\n", "0 34 Advertiser (Adelaide, SA : 1889-1931) \n", "1 13 Argus (Melbourne, Vic. : 1848-1954) \n", "2 16 Brisbane Courier (Qld. : 1864-1933) \n", "3 11 Canberra Times (ACT : 1926-1954) \n", "4 24 Colonial Times (Hobart, Tas. : 1828-1857) \n", "... ... ... \n", "107017 1331 South Australian Record and Australasian and S... \n", "107018 1369 Territory of Papua Government Gazette (Papua N... \n", "107019 1371 Territory of Papua and New Guinea Government G... \n", "107020 1370 Territory of Papua-New Guinea Government Gazet... \n", "107021 1391 Tribune (Philippines : 1932 - 1945) \n", "\n", " title place \\\n", "0 Advertiser Adelaide, SA \n", "1 Argus Melbourne, VIC \n", "2 Brisbane Courier QLD \n", "3 Canberra Times ACT \n", "4 Colonial Times Hobart, TAS \n", "... ... ... \n", "107017 South Australian Record and Australasian and S... London, ENGLAND \n", "107018 Territory of Papua Government Gazette Papua New GUINEA \n", "107019 Territory of Papua and New Guinea Government G... \n", "107020 Territory of Papua-New Guinea Government Gazette \n", "107021 Tribune PHILIPPINES \n", "\n", " dates capture_date capture_timestamp \n", "0 1889 - 1931 2009-11-12 20091112000713 \n", "1 1848 - 1954 2009-11-12 20091112000713 \n", "2 1864 - 1933 2009-11-12 20091112000713 \n", "3 1926 - 1954 2009-11-12 20091112000713 \n", "4 1828 - 1857 2009-11-12 20091112000713 \n", "... ... ... ... \n", "107017 1840 - 1841 2022-01-16 20220116142742 \n", "107018 1906 - 1942 2022-01-16 20220116142742 \n", "107019 1949 - 1971 2022-01-16 20220116142742 \n", "107020 1945 - 1949 2022-01-16 20220116142742 \n", "107021 1932 - 1945 2022-01-16 20220116142742 \n", "\n", "[107022 rows x 7 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "130" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of captures\n", "len(df[\"capture_timestamp\"].unique())" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "120" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of days on which the pages were captured\n", "len(df[\"capture_date\"].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save this dataset as a CSV file." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df.to_csv(\"trove_newspaper_titles_2009_2021.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How did the number of titles change over time?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
capture_datetotal
02022-01-161700
12022-01-101700
22021-12-131697
32021-11-201690
42021-11-161690
.........
1152010-05-0137
1162009-11-2434
1172009-11-2234
1182009-12-1234
1192009-11-1234
\n", "

120 rows × 2 columns

\n", "
" ], "text/plain": [ " capture_date total\n", "0 2022-01-16 1700\n", "1 2022-01-10 1700\n", "2 2021-12-13 1697\n", "3 2021-11-20 1690\n", "4 2021-11-16 1690\n", ".. ... ...\n", "115 2010-05-01 37\n", "116 2009-11-24 34\n", "117 2009-11-22 34\n", "118 2009-12-12 34\n", "119 2009-11-12 34\n", "\n", "[120 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop duplicates in cases where there were mutiple captures on a single day\n", "captures_df = df.drop_duplicates(subset=[\"capture_date\", \"full_title\"])\n", "\n", "# Calculate totals per capture\n", "capture_totals = captures_df[\"capture_date\"].value_counts().to_frame().reset_index()\n", "capture_totals.columns = [\"capture_date\", \"total\"]\n", "capture_totals" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(capture_totals).mark_line(point=True).encode(\n", " x=alt.X(\"capture_date:T\", title=\"Date captured\"),\n", " y=alt.Y(\"total:Q\", title=\"Number of newspaper titles\"),\n", " tooltip=[alt.Tooltip(\"capture_date:T\", format=\"%e %b %Y\"), \"total:Q\"],\n", ").properties(width=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When did titles first appear?\n", "\n", "For historiographical purposes, its useful to know when a particular title first appeared in Trove. Here we'll only keep the first appearance of each title (or any subsequent changes to its date range / location)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "first_appearance = df.drop_duplicates(subset=[\"title\", \"place\", \"dates\"])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
title_idfull_titletitleplacedatescapture_datecapture_timestamp
034Advertiser (Adelaide, SA : 1889-1931)AdvertiserAdelaide, SA1889 - 19312009-11-1220091112000713
113Argus (Melbourne, Vic. : 1848-1954)ArgusMelbourne, VIC1848 - 19542009-11-1220091112000713
216Brisbane Courier (Qld. : 1864-1933)Brisbane CourierQLD1864 - 19332009-11-1220091112000713
311Canberra Times (ACT : 1926-1954)Canberra TimesACT1926 - 19542009-11-1220091112000713
424Colonial Times (Hobart, Tas. : 1828-1857)Colonial TimesHobart, TAS1828 - 18572009-11-1220091112000713
........................
1050231773Dawn Newsletter (Perth, WA : 1952 - 1954)Dawn NewsletterPerth, WA1952 - 19542022-01-1020220110214554
1051121388La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984)La RondinePerth, WA1970 - 1974; 1983 - 19842022-01-1020220110214554
1051211537Listening Post (Perth, WA : 1921 - 1954)Listening PostPerth, WA1921 - 19542022-01-1020220110214554
10527499Western Argus (Kalgoorlie, WA : 1894 - 1895)Western ArgusKalgoorlie, WA1894 - 18952022-01-1020220110214554
1068871649North Coolgardie Herald and Miners Daily News ...North Coolgardie Herald and Miners Daily NewsMenzies, WA1899 - 19042022-01-1620220116142742
\n", "

2120 rows × 7 columns

\n", "
" ], "text/plain": [ " title_id full_title \\\n", "0 34 Advertiser (Adelaide, SA : 1889-1931) \n", "1 13 Argus (Melbourne, Vic. : 1848-1954) \n", "2 16 Brisbane Courier (Qld. : 1864-1933) \n", "3 11 Canberra Times (ACT : 1926-1954) \n", "4 24 Colonial Times (Hobart, Tas. : 1828-1857) \n", "... ... ... \n", "105023 1773 Dawn Newsletter (Perth, WA : 1952 - 1954) \n", "105112 1388 La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) \n", "105121 1537 Listening Post (Perth, WA : 1921 - 1954) \n", "105274 99 Western Argus (Kalgoorlie, WA : 1894 - 1895) \n", "106887 1649 North Coolgardie Herald and Miners Daily News ... \n", "\n", " title place \\\n", "0 Advertiser Adelaide, SA \n", "1 Argus Melbourne, VIC \n", "2 Brisbane Courier QLD \n", "3 Canberra Times ACT \n", "4 Colonial Times Hobart, TAS \n", "... ... ... \n", "105023 Dawn Newsletter Perth, WA \n", "105112 La Rondine Perth, WA \n", "105121 Listening Post Perth, WA \n", "105274 Western Argus Kalgoorlie, WA \n", "106887 North Coolgardie Herald and Miners Daily News Menzies, WA \n", "\n", " dates capture_date capture_timestamp \n", "0 1889 - 1931 2009-11-12 20091112000713 \n", "1 1848 - 1954 2009-11-12 20091112000713 \n", "2 1864 - 1933 2009-11-12 20091112000713 \n", "3 1926 - 1954 2009-11-12 20091112000713 \n", "4 1828 - 1857 2009-11-12 20091112000713 \n", "... ... ... ... \n", "105023 1952 - 1954 2022-01-10 20220110214554 \n", "105112 1970 - 1974; 1983 - 1984 2022-01-10 20220110214554 \n", "105121 1921 - 1954 2022-01-10 20220110214554 \n", "105274 1894 - 1895 2022-01-10 20220110214554 \n", "106887 1899 - 1904 2022-01-16 20220116142742 \n", "\n", "[2120 rows x 7 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_appearance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find when a particular newspaper first appeared." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
title_idfull_titletitleplacedatescapture_datecapture_timestamp
311Canberra Times (ACT : 1926-1954)Canberra TimesACT1926 - 19542009-11-1220091112000713
939511Canberra Times (ACT : 1926 - 1995)Canberra TimesACT1926 - 19952012-12-2720121227113753
\n", "
" ], "text/plain": [ " title_id full_title title place \\\n", "3 11 Canberra Times (ACT : 1926-1954) Canberra Times ACT \n", "9395 11 Canberra Times (ACT : 1926 - 1995) Canberra Times ACT \n", "\n", " dates capture_date capture_timestamp \n", "3 1926 - 1954 2009-11-12 20091112000713 \n", "9395 1926 - 1995 2012-12-27 20121227113753 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_appearance.loc[first_appearance[\"title\"] == \"Canberra Times\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate an alphabetical list for easy browsing. View the [results as a Gist](https://gist.github.com/wragge/7d80507c3e7957e271c572b8f664031a)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "with Path(\"titles_list.md\").open(\"w\") as titles_list:\n", " for title, group in first_appearance.groupby([\"title\", \"title_id\"]):\n", " places = \" | \".join(group[\"place\"].unique())\n", " titles_list.write(\n", " f'

{title[0]} ({places})

'\n", " )\n", " titles_list.write(\n", " group.sort_values(by=\"capture_date\")[\n", " [\"capture_date\", \"dates\", \"place\"]\n", " ].to_html(index=False)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save this dataset to CSV." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "first_appearance.to_csv(\n", " \"trove_newspaper_titles_first_appearance_2009_2021.csv\", index=False\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }