{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gathering historical data about the addition of newspaper titles to Trove\n",
    "\n",
    "The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. This notebook uses web archives to extract lists of newspapers in Trove over time, and chart Trove's development.\n",
    "\n",
    "Trove has always provided a browseable list of digitised newspaper titles. The url and format of this list has changed over time, but it's possible to find captures of this page in the Internet Archive and extract the full list of titles. The pages are also captured in the Australian Web Archive, but the Wayback Machine has a more detailed record.\n",
    "\n",
    "The pages that I'm looking for are:\n",
    "\n",
    "* [http://trove.nla.gov.au/ndp/del/titles](https://web.archive.org/web/*/http://trove.nla.gov.au/ndp/del/titles)\n",
    "* [https://trove.nla.gov.au/newspaper/about](https://web.archive.org/web/*/https://trove.nla.gov.au/newspaper/about)\n",
    "\n",
    "This notebook creates the following data files:\n",
    "\n",
    "* [trove_newspaper_titles_2009_2021.csv](https://github.com/GLAM-Workbench/trove-newspapers/blob/master/trove_newspaper_titles_2009_2021.csv) – complete dataset of captures and titles\n",
    "* [trove_newspaper_titles_first_appearance_2009_2021.csv](https://github.com/GLAM-Workbench/trove-newspapers/blob/master/trove_newspaper_titles_first_appearance_2009_2021.csv) – filtered dataset, showing only the first appearance of each title / place / date range combination\n",
    "\n",
    "I've also created a [browseable list of titles](https://gist.github.com/wragge/7d80507c3e7957e271c572b8f664031a), showing when they first appeared in Trove."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "from pathlib import Path\n",
    "\n",
    "import altair as alt\n",
    "import arrow\n",
    "import pandas as pd\n",
    "import requests_cache\n",
    "from bs4 import BeautifulSoup\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from surt import surt\n",
    "\n",
    "s = requests_cache.CachedSession(\"archived_titles\")\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code for harvesting web archive captures\n",
    "\n",
    "We're using the Memento protocol to get a list of captures. See the [Web Archives section](https://glam-workbench.net/web-archives/) of the GLAM Workbench for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The code in this cell is copied from notebooks in the Web Archives section of the GLAM Workbench (https://glam-workbench.net/web-archives/)\n",
    "# In particular see: https://glam-workbench.net/web-archives/#find-all-the-archived-versions-of-a-web-page\n",
    "\n",
    "# These are the repositories we'll be using\n",
    "TIMEGATES = {\n",
    "    \"awa\": \"https://web.archive.org.au/awa/\",\n",
    "    \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/wayback/\",\n",
    "    \"ukwa\": \"https://www.webarchive.org.uk/wayback/en/archive/\",\n",
    "    \"ia\": \"https://web.archive.org/web/\",\n",
    "}\n",
    "\n",
    "\n",
    "def convert_lists_to_dicts(results):\n",
    "    \"\"\"\n",
    "    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n",
    "    Renames keys to standardise IA with other Timemaps.\n",
    "    \"\"\"\n",
    "    if results:\n",
    "        keys = results[0]\n",
    "        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n",
    "    else:\n",
    "        results_as_dicts = results\n",
    "    for d in results_as_dicts:\n",
    "        d[\"status\"] = d.pop(\"statuscode\")\n",
    "        d[\"mime\"] = d.pop(\"mimetype\")\n",
    "        d[\"url\"] = d.pop(\"original\")\n",
    "    return results_as_dicts\n",
    "\n",
    "\n",
    "def get_capture_data_from_memento(url, request_type=\"head\"):\n",
    "    \"\"\"\n",
    "    For OpenWayback systems this can get some extra capture info to insert into Timemaps.\n",
    "    \"\"\"\n",
    "    if request_type == \"head\":\n",
    "        response = s.head(url)\n",
    "    else:\n",
    "        response = s.get(url)\n",
    "    headers = response.headers\n",
    "    length = headers.get(\"x-archive-orig-content-length\")\n",
    "    status = headers.get(\"x-archive-orig-status\")\n",
    "    status = status.split(\" \")[0] if status else None\n",
    "    mime = headers.get(\"x-archive-orig-content-type\")\n",
    "    mime = mime.split(\";\")[0] if mime else None\n",
    "    return {\"length\": length, \"status\": status, \"mime\": mime}\n",
    "\n",
    "\n",
    "def convert_link_to_json(results, enrich_data=False):\n",
    "    \"\"\"\n",
    "    Converts link formatted Timemap to JSON.\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for line in results.splitlines():\n",
    "        parts = line.split(\"; \")\n",
    "        if len(parts) > 1:\n",
    "            link_type = re.search(\n",
    "                r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n",
    "                parts[1],\n",
    "            ).group(1)\n",
    "            if link_type == \"memento\":\n",
    "                link = parts[0].strip(\"<>\")\n",
    "                timestamp, original = re.search(r\"/(\\d{14})/(.*)$\", link).groups()\n",
    "                capture = {\n",
    "                    \"urlkey\": surt(original),\n",
    "                    \"timestamp\": timestamp,\n",
    "                    \"url\": original,\n",
    "                }\n",
    "                if enrich_data:\n",
    "                    capture.update(get_capture_data_from_memento(link))\n",
    "                    print(capture)\n",
    "                data.append(capture)\n",
    "    return data\n",
    "\n",
    "\n",
    "def get_timemap_as_json(timegate, url, enrich_data=False):\n",
    "    \"\"\"\n",
    "    Get a Timemap then normalise results (if necessary) to return a list of dicts.\n",
    "    \"\"\"\n",
    "    tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n",
    "    response = s.get(tg_url)\n",
    "    response_type = response.headers[\"content-type\"]\n",
    "    if response_type == \"text/x-ndjson\":\n",
    "        data = [json.loads(line) for line in response.text.splitlines()]\n",
    "    elif response_type == \"application/json\":\n",
    "        data = convert_lists_to_dicts(response.json())\n",
    "    elif response_type in [\"application/link-format\", \"text/html;charset=utf-8\"]:\n",
    "        data = convert_link_to_json(response.text, enrich_data=enrich_data)\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest the title data from the Internet Archive\n",
    "\n",
    "This gets the web page captures from the Internet Archive, scrapes the list of titles from the page, then does a bit of normalisation of the title data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles = []\n",
    "\n",
    "# These are the pages that listed available titles.\n",
    "# There was a change in 2016\n",
    "pages = [\n",
    "    {\"url\": \"http://trove.nla.gov.au/ndp/del/titles\", \"path\": \"/ndp/del/title/\"},\n",
    "    {\"url\": \"https://trove.nla.gov.au/newspaper/about\", \"path\": \"/newspaper/title/\"},\n",
    "]\n",
    "\n",
    "for page in pages:\n",
    "    for capture in get_timemap_as_json(\"ia\", page[\"url\"]):\n",
    "        if capture[\"status\"] == \"200\":\n",
    "            url = f'https://web.archive.org/web/{capture[\"timestamp\"]}id_/{capture[\"url\"]}'\n",
    "            # print(url)\n",
    "            capture_date = arrow.get(capture[\"timestamp\"][:8], \"YYYYMMDD\").format(\n",
    "                \"YYYY-MM-DD\"\n",
    "            )\n",
    "            # print(capture_date)\n",
    "            response = s.get(url)\n",
    "            soup = BeautifulSoup(response.content)\n",
    "            title_links = soup.find_all(\"a\", href=re.compile(page[\"path\"]))\n",
    "            for title in title_links:\n",
    "                # Get the title text\n",
    "                full_title = title.get_text().strip()\n",
    "\n",
    "                # Get the title id\n",
    "                title_id = re.search(r\"\\/(\\d+)\\/?$\", title[\"href\"]).group(1)\n",
    "\n",
    "                # Most of the code below is aimed at normalising the publication place and dates values to allow for easy grouping & deduplication\n",
    "                brief_title = re.sub(r\"\\(.+\\)\\s*$\", \"\", full_title).strip()\n",
    "                try:\n",
    "                    details = re.search(r\"\\((.+)\\)\\s*$\", full_title).group(1).split(\":\")\n",
    "                except AttributeError:\n",
    "                    place = \"\"\n",
    "                    dates = \"\"\n",
    "                else:\n",
    "                    try:\n",
    "                        place = details[0].strip()\n",
    "                        # Normalise states\n",
    "                        try:\n",
    "                            place = re.sub(\n",
    "                                r\"(, )?([A-Za-z]+)[\\.\\s]*$\",\n",
    "                                lambda match: f'{match.group(1) if match.group(1) else \"\"}{match.group(2).upper()}',\n",
    "                                place,\n",
    "                            )\n",
    "                        except AttributeError:\n",
    "                            pass\n",
    "                        # Normalise dates\n",
    "                        dates = \" - \".join(\n",
    "                            [d.strip() for d in details[1].strip().split(\"-\")]\n",
    "                        )\n",
    "                    except IndexError:\n",
    "                        place = \"\"\n",
    "                        dates = \" - \".join(\n",
    "                            [d.strip() for d in details[0].strip().split(\"-\")]\n",
    "                        )\n",
    "                titles.append(\n",
    "                    {\n",
    "                        \"title_id\": title_id,\n",
    "                        \"full_title\": full_title,\n",
    "                        \"title\": brief_title,\n",
    "                        \"place\": place,\n",
    "                        \"dates\": dates,\n",
    "                        \"capture_date\": capture_date,\n",
    "                        \"capture_timestamp\": capture[\"timestamp\"],\n",
    "                    }\n",
    "                )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert the title data to a DataFrame for analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title_id</th>\n",
       "      <th>full_title</th>\n",
       "      <th>title</th>\n",
       "      <th>place</th>\n",
       "      <th>dates</th>\n",
       "      <th>capture_date</th>\n",
       "      <th>capture_timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>34</td>\n",
       "      <td>Advertiser (Adelaide, SA : 1889-1931)</td>\n",
       "      <td>Advertiser</td>\n",
       "      <td>Adelaide, SA</td>\n",
       "      <td>1889 - 1931</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>13</td>\n",
       "      <td>Argus (Melbourne, Vic. : 1848-1954)</td>\n",
       "      <td>Argus</td>\n",
       "      <td>Melbourne, VIC</td>\n",
       "      <td>1848 - 1954</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>16</td>\n",
       "      <td>Brisbane Courier (Qld. : 1864-1933)</td>\n",
       "      <td>Brisbane Courier</td>\n",
       "      <td>QLD</td>\n",
       "      <td>1864 - 1933</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>11</td>\n",
       "      <td>Canberra Times (ACT : 1926-1954)</td>\n",
       "      <td>Canberra Times</td>\n",
       "      <td>ACT</td>\n",
       "      <td>1926 - 1954</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24</td>\n",
       "      <td>Colonial Times (Hobart, Tas. : 1828-1857)</td>\n",
       "      <td>Colonial Times</td>\n",
       "      <td>Hobart, TAS</td>\n",
       "      <td>1828 - 1857</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107017</th>\n",
       "      <td>1331</td>\n",
       "      <td>South Australian Record and Australasian and S...</td>\n",
       "      <td>South Australian Record and Australasian and S...</td>\n",
       "      <td>London, ENGLAND</td>\n",
       "      <td>1840 - 1841</td>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>20220116142742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107018</th>\n",
       "      <td>1369</td>\n",
       "      <td>Territory of Papua Government Gazette (Papua N...</td>\n",
       "      <td>Territory of Papua Government Gazette</td>\n",
       "      <td>Papua New GUINEA</td>\n",
       "      <td>1906 - 1942</td>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>20220116142742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107019</th>\n",
       "      <td>1371</td>\n",
       "      <td>Territory of Papua and New Guinea Government G...</td>\n",
       "      <td>Territory of Papua and New Guinea Government G...</td>\n",
       "      <td></td>\n",
       "      <td>1949 - 1971</td>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>20220116142742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107020</th>\n",
       "      <td>1370</td>\n",
       "      <td>Territory of Papua-New Guinea Government Gazet...</td>\n",
       "      <td>Territory of Papua-New Guinea Government Gazette</td>\n",
       "      <td></td>\n",
       "      <td>1945 - 1949</td>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>20220116142742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107021</th>\n",
       "      <td>1391</td>\n",
       "      <td>Tribune (Philippines : 1932 - 1945)</td>\n",
       "      <td>Tribune</td>\n",
       "      <td>PHILIPPINES</td>\n",
       "      <td>1932 - 1945</td>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>20220116142742</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>107022 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       title_id                                         full_title  \\\n",
       "0            34              Advertiser (Adelaide, SA : 1889-1931)   \n",
       "1            13                Argus (Melbourne, Vic. : 1848-1954)   \n",
       "2            16                Brisbane Courier (Qld. : 1864-1933)   \n",
       "3            11                   Canberra Times (ACT : 1926-1954)   \n",
       "4            24          Colonial Times (Hobart, Tas. : 1828-1857)   \n",
       "...         ...                                                ...   \n",
       "107017     1331  South Australian Record and Australasian and S...   \n",
       "107018     1369  Territory of Papua Government Gazette (Papua N...   \n",
       "107019     1371  Territory of Papua and New Guinea Government G...   \n",
       "107020     1370  Territory of Papua-New Guinea Government Gazet...   \n",
       "107021     1391                Tribune (Philippines : 1932 - 1945)   \n",
       "\n",
       "                                                    title             place  \\\n",
       "0                                              Advertiser      Adelaide, SA   \n",
       "1                                                   Argus    Melbourne, VIC   \n",
       "2                                        Brisbane Courier               QLD   \n",
       "3                                          Canberra Times               ACT   \n",
       "4                                          Colonial Times       Hobart, TAS   \n",
       "...                                                   ...               ...   \n",
       "107017  South Australian Record and Australasian and S...   London, ENGLAND   \n",
       "107018              Territory of Papua Government Gazette  Papua New GUINEA   \n",
       "107019  Territory of Papua and New Guinea Government G...                     \n",
       "107020   Territory of Papua-New Guinea Government Gazette                     \n",
       "107021                                            Tribune       PHILIPPINES   \n",
       "\n",
       "              dates capture_date capture_timestamp  \n",
       "0       1889 - 1931   2009-11-12    20091112000713  \n",
       "1       1848 - 1954   2009-11-12    20091112000713  \n",
       "2       1864 - 1933   2009-11-12    20091112000713  \n",
       "3       1926 - 1954   2009-11-12    20091112000713  \n",
       "4       1828 - 1857   2009-11-12    20091112000713  \n",
       "...             ...          ...               ...  \n",
       "107017  1840 - 1841   2022-01-16    20220116142742  \n",
       "107018  1906 - 1942   2022-01-16    20220116142742  \n",
       "107019  1949 - 1971   2022-01-16    20220116142742  \n",
       "107020  1945 - 1949   2022-01-16    20220116142742  \n",
       "107021  1932 - 1945   2022-01-16    20220116142742  \n",
       "\n",
       "[107022 rows x 7 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "130"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Number of captures\n",
    "len(df[\"capture_timestamp\"].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "120"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Number of days on which the pages were captured\n",
    "len(df[\"capture_date\"].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save this dataset as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df.to_csv(\"trove_newspaper_titles_2009_2021.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How did the number of titles change over time?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>capture_date</th>\n",
       "      <th>total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>1700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2022-01-10</td>\n",
       "      <td>1700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2021-12-13</td>\n",
       "      <td>1697</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2021-11-20</td>\n",
       "      <td>1690</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2021-11-16</td>\n",
       "      <td>1690</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115</th>\n",
       "      <td>2010-05-01</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>116</th>\n",
       "      <td>2009-11-24</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117</th>\n",
       "      <td>2009-11-22</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>2009-12-12</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>119</th>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>120 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    capture_date  total\n",
       "0     2022-01-16   1700\n",
       "1     2022-01-10   1700\n",
       "2     2021-12-13   1697\n",
       "3     2021-11-20   1690\n",
       "4     2021-11-16   1690\n",
       "..           ...    ...\n",
       "115   2010-05-01     37\n",
       "116   2009-11-24     34\n",
       "117   2009-11-22     34\n",
       "118   2009-12-12     34\n",
       "119   2009-11-12     34\n",
       "\n",
       "[120 rows x 2 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Drop duplicates in cases where there were mutiple captures on a single day\n",
    "captures_df = df.drop_duplicates(subset=[\"capture_date\", \"full_title\"])\n",
    "\n",
    "# Calculate totals per capture\n",
    "capture_totals = captures_df[\"capture_date\"].value_counts().to_frame().reset_index()\n",
    "capture_totals.columns = [\"capture_date\", \"total\"]\n",
    "capture_totals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div id=\"altair-viz-54e4be5e85414929b094c349b53aa374\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-54e4be5e85414929b094c349b53aa374\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-54e4be5e85414929b094c349b53aa374\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm//vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm//vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm//vega-lite@4.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm//vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"4.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"data\": {\"name\": \"data-fd12783c87fe973c07b20295ebc52308\"}, \"mark\": {\"type\": \"line\", \"point\": true}, \"encoding\": {\"tooltip\": [{\"field\": \"capture_date\", \"format\": \"%e %b %Y\", \"type\": \"temporal\"}, {\"field\": \"total\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"capture_date\", \"title\": \"Date captured\", \"type\": \"temporal\"}, \"y\": {\"field\": \"total\", \"title\": \"Number of newspaper titles\", \"type\": \"quantitative\"}}, \"width\": 700, \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.17.0.json\", \"datasets\": {\"data-fd12783c87fe973c07b20295ebc52308\": [{\"capture_date\": \"2022-01-16\", \"total\": 1700}, {\"capture_date\": \"2022-01-10\", \"total\": 1700}, {\"capture_date\": \"2021-12-13\", \"total\": 1697}, {\"capture_date\": \"2021-11-20\", \"total\": 1690}, {\"capture_date\": \"2021-11-16\", \"total\": 1690}, {\"capture_date\": \"2021-10-06\", \"total\": 1683}, {\"capture_date\": \"2021-08-11\", \"total\": 1683}, {\"capture_date\": \"2021-06-10\", \"total\": 1672}, {\"capture_date\": \"2021-04-15\", \"total\": 1666}, {\"capture_date\": \"2021-04-10\", \"total\": 1666}, {\"capture_date\": \"2021-03-11\", \"total\": 1658}, {\"capture_date\": \"2021-02-05\", \"total\": 1649}, {\"capture_date\": \"2020-11-12\", \"total\": 1625}, {\"capture_date\": \"2020-05-10\", \"total\": 1553}, {\"capture_date\": \"2020-01-15\", \"total\": 1505}, {\"capture_date\": \"2019-10-28\", \"total\": 1488}, {\"capture_date\": \"2019-06-26\", \"total\": 1454}, {\"capture_date\": \"2019-05-19\", \"total\": 1447}, {\"capture_date\": \"2019-03-22\", \"total\": 1434}, {\"capture_date\": \"2018-11-29\", \"total\": 1399}, {\"capture_date\": \"2018-11-28\", \"total\": 1399}, {\"capture_date\": \"2018-10-15\", \"total\": 1371}, {\"capture_date\": \"2018-09-19\", \"total\": 1367}, {\"capture_date\": \"2018-08-19\", \"total\": 1367}, {\"capture_date\": \"2017-10-22\", \"total\": 1296}, {\"capture_date\": \"2017-09-20\", \"total\": 1291}, {\"capture_date\": \"2017-08-20\", \"total\": 1280}, {\"capture_date\": \"2017-07-18\", \"total\": 1280}, {\"capture_date\": \"2017-08-05\", \"total\": 1280}, {\"capture_date\": \"2017-08-22\", \"total\": 1280}, {\"capture_date\": \"2017-07-01\", \"total\": 1279}, {\"capture_date\": \"2017-06-24\", \"total\": 1274}, {\"capture_date\": \"2017-05-28\", \"total\": 1272}, {\"capture_date\": \"2017-04-28\", \"total\": 1259}, {\"capture_date\": \"2016-11-21\", \"total\": 1199}, {\"capture_date\": \"2016-09-12\", \"total\": 1199}, {\"capture_date\": \"2016-08-18\", \"total\": 1199}, {\"capture_date\": \"2016-08-10\", \"total\": 1199}, {\"capture_date\": \"2016-07-21\", \"total\": 1198}, {\"capture_date\": \"2016-06-29\", \"total\": 1187}, {\"capture_date\": \"2016-06-28\", \"total\": 1186}, {\"capture_date\": \"2016-06-04\", \"total\": 1148}, {\"capture_date\": \"2016-05-05\", \"total\": 1123}, {\"capture_date\": \"2016-04-04\", \"total\": 1112}, {\"capture_date\": \"2016-03-03\", \"total\": 1108}, {\"capture_date\": \"2016-02-29\", \"total\": 1104}, {\"capture_date\": \"2016-01-17\", \"total\": 1086}, {\"capture_date\": \"2016-01-14\", \"total\": 1085}, {\"capture_date\": \"2016-01-08\", \"total\": 1081}, {\"capture_date\": \"2015-11-18\", \"total\": 1046}, {\"capture_date\": \"2015-11-13\", \"total\": 1041}, {\"capture_date\": \"2015-11-10\", \"total\": 1039}, {\"capture_date\": \"2015-11-01\", \"total\": 1031}, {\"capture_date\": \"2015-09-14\", \"total\": 973}, {\"capture_date\": \"2015-09-05\", \"total\": 970}, {\"capture_date\": \"2015-09-01\", \"total\": 969}, {\"capture_date\": \"2015-08-21\", \"total\": 963}, {\"capture_date\": \"2015-08-13\", \"total\": 956}, {\"capture_date\": \"2015-07-02\", \"total\": 921}, {\"capture_date\": \"2015-04-25\", \"total\": 878}, {\"capture_date\": \"2015-04-01\", \"total\": 873}, {\"capture_date\": \"2015-03-01\", \"total\": 852}, {\"capture_date\": \"2015-02-24\", \"total\": 849}, {\"capture_date\": \"2015-02-23\", \"total\": 848}, {\"capture_date\": \"2015-01-28\", \"total\": 842}, {\"capture_date\": \"2014-12-28\", \"total\": 823}, {\"capture_date\": \"2014-10-22\", \"total\": 785}, {\"capture_date\": \"2014-10-14\", \"total\": 782}, {\"capture_date\": \"2014-09-24\", \"total\": 769}, {\"capture_date\": \"2014-07-31\", \"total\": 711}, {\"capture_date\": \"2014-07-22\", \"total\": 711}, {\"capture_date\": \"2014-05-30\", \"total\": 700}, {\"capture_date\": \"2014-04-07\", \"total\": 672}, {\"capture_date\": \"2014-02-11\", \"total\": 663}, {\"capture_date\": \"2013-12-07\", \"total\": 634}, {\"capture_date\": \"2013-10-28\", \"total\": 607}, {\"capture_date\": \"2013-10-12\", \"total\": 601}, {\"capture_date\": \"2013-10-06\", \"total\": 598}, {\"capture_date\": \"2013-05-09\", \"total\": 430}, {\"capture_date\": \"2013-01-17\", \"total\": 349}, {\"capture_date\": \"2012-12-27\", \"total\": 345}, {\"capture_date\": \"2012-10-30\", \"total\": 335}, {\"capture_date\": \"2012-10-27\", \"total\": 334}, {\"capture_date\": \"2012-10-26\", \"total\": 334}, {\"capture_date\": \"2012-09-23\", \"total\": 322}, {\"capture_date\": \"2012-09-09\", \"total\": 322}, {\"capture_date\": \"2012-08-25\", \"total\": 297}, {\"capture_date\": \"2012-08-22\", \"total\": 296}, {\"capture_date\": \"2012-06-27\", \"total\": 284}, {\"capture_date\": \"2012-06-26\", \"total\": 284}, {\"capture_date\": \"2012-05-13\", \"total\": 275}, {\"capture_date\": \"2012-05-10\", \"total\": 274}, {\"capture_date\": \"2012-05-07\", \"total\": 270}, {\"capture_date\": \"2012-05-06\", \"total\": 269}, {\"capture_date\": \"2012-05-04\", \"total\": 269}, {\"capture_date\": \"2012-04-19\", \"total\": 266}, {\"capture_date\": \"2012-02-15\", \"total\": 250}, {\"capture_date\": \"2012-01-20\", \"total\": 242}, {\"capture_date\": \"2012-01-07\", \"total\": 234}, {\"capture_date\": \"2011-12-19\", \"total\": 225}, {\"capture_date\": \"2011-12-06\", \"total\": 221}, {\"capture_date\": \"2011-11-16\", \"total\": 216}, {\"capture_date\": \"2011-10-27\", \"total\": 215}, {\"capture_date\": \"2011-10-19\", \"total\": 214}, {\"capture_date\": \"2011-10-16\", \"total\": 210}, {\"capture_date\": \"2011-09-23\", \"total\": 205}, {\"capture_date\": \"2011-09-04\", \"total\": 194}, {\"capture_date\": \"2011-08-14\", \"total\": 176}, {\"capture_date\": \"2011-05-14\", \"total\": 137}, {\"capture_date\": \"2011-04-06\", \"total\": 118}, {\"capture_date\": \"2011-03-26\", \"total\": 112}, {\"capture_date\": \"2011-03-12\", \"total\": 104}, {\"capture_date\": \"2011-02-27\", \"total\": 99}, {\"capture_date\": \"2010-04-28\", \"total\": 37}, {\"capture_date\": \"2010-04-16\", \"total\": 37}, {\"capture_date\": \"2010-05-01\", \"total\": 37}, {\"capture_date\": \"2009-11-24\", \"total\": 34}, {\"capture_date\": \"2009-11-22\", \"total\": 34}, {\"capture_date\": \"2009-12-12\", \"total\": 34}, {\"capture_date\": \"2009-11-12\", \"total\": 34}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.Chart(...)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(capture_totals).mark_line(point=True).encode(\n",
    "    x=alt.X(\"capture_date:T\", title=\"Date captured\"),\n",
    "    y=alt.Y(\"total:Q\", title=\"Number of newspaper titles\"),\n",
    "    tooltip=[alt.Tooltip(\"capture_date:T\", format=\"%e %b %Y\"), \"total:Q\"],\n",
    ").properties(width=700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When did titles first appear?\n",
    "\n",
    "For historiographical purposes, its useful to know when a particular title first appeared in Trove. Here we'll only keep the first appearance of each title (or any subsequent changes to its date range / location)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_appearance = df.drop_duplicates(subset=[\"title\", \"place\", \"dates\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title_id</th>\n",
       "      <th>full_title</th>\n",
       "      <th>title</th>\n",
       "      <th>place</th>\n",
       "      <th>dates</th>\n",
       "      <th>capture_date</th>\n",
       "      <th>capture_timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>34</td>\n",
       "      <td>Advertiser (Adelaide, SA : 1889-1931)</td>\n",
       "      <td>Advertiser</td>\n",
       "      <td>Adelaide, SA</td>\n",
       "      <td>1889 - 1931</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>13</td>\n",
       "      <td>Argus (Melbourne, Vic. : 1848-1954)</td>\n",
       "      <td>Argus</td>\n",
       "      <td>Melbourne, VIC</td>\n",
       "      <td>1848 - 1954</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>16</td>\n",
       "      <td>Brisbane Courier (Qld. : 1864-1933)</td>\n",
       "      <td>Brisbane Courier</td>\n",
       "      <td>QLD</td>\n",
       "      <td>1864 - 1933</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>11</td>\n",
       "      <td>Canberra Times (ACT : 1926-1954)</td>\n",
       "      <td>Canberra Times</td>\n",
       "      <td>ACT</td>\n",
       "      <td>1926 - 1954</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24</td>\n",
       "      <td>Colonial Times (Hobart, Tas. : 1828-1857)</td>\n",
       "      <td>Colonial Times</td>\n",
       "      <td>Hobart, TAS</td>\n",
       "      <td>1828 - 1857</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105023</th>\n",
       "      <td>1773</td>\n",
       "      <td>Dawn Newsletter (Perth, WA : 1952 - 1954)</td>\n",
       "      <td>Dawn Newsletter</td>\n",
       "      <td>Perth, WA</td>\n",
       "      <td>1952 - 1954</td>\n",
       "      <td>2022-01-10</td>\n",
       "      <td>20220110214554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105112</th>\n",
       "      <td>1388</td>\n",
       "      <td>La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984)</td>\n",
       "      <td>La Rondine</td>\n",
       "      <td>Perth, WA</td>\n",
       "      <td>1970 - 1974; 1983 - 1984</td>\n",
       "      <td>2022-01-10</td>\n",
       "      <td>20220110214554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105121</th>\n",
       "      <td>1537</td>\n",
       "      <td>Listening Post (Perth, WA : 1921 - 1954)</td>\n",
       "      <td>Listening Post</td>\n",
       "      <td>Perth, WA</td>\n",
       "      <td>1921 - 1954</td>\n",
       "      <td>2022-01-10</td>\n",
       "      <td>20220110214554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105274</th>\n",
       "      <td>99</td>\n",
       "      <td>Western Argus (Kalgoorlie, WA : 1894 - 1895)</td>\n",
       "      <td>Western Argus</td>\n",
       "      <td>Kalgoorlie, WA</td>\n",
       "      <td>1894 - 1895</td>\n",
       "      <td>2022-01-10</td>\n",
       "      <td>20220110214554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>106887</th>\n",
       "      <td>1649</td>\n",
       "      <td>North Coolgardie Herald and Miners Daily News ...</td>\n",
       "      <td>North Coolgardie Herald and Miners Daily News</td>\n",
       "      <td>Menzies, WA</td>\n",
       "      <td>1899 - 1904</td>\n",
       "      <td>2022-01-16</td>\n",
       "      <td>20220116142742</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2120 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       title_id                                         full_title  \\\n",
       "0            34              Advertiser (Adelaide, SA : 1889-1931)   \n",
       "1            13                Argus (Melbourne, Vic. : 1848-1954)   \n",
       "2            16                Brisbane Courier (Qld. : 1864-1933)   \n",
       "3            11                   Canberra Times (ACT : 1926-1954)   \n",
       "4            24          Colonial Times (Hobart, Tas. : 1828-1857)   \n",
       "...         ...                                                ...   \n",
       "105023     1773          Dawn Newsletter (Perth, WA : 1952 - 1954)   \n",
       "105112     1388  La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984)   \n",
       "105121     1537           Listening Post (Perth, WA : 1921 - 1954)   \n",
       "105274       99       Western Argus (Kalgoorlie, WA : 1894 - 1895)   \n",
       "106887     1649  North Coolgardie Herald and Miners Daily News ...   \n",
       "\n",
       "                                                title           place  \\\n",
       "0                                          Advertiser    Adelaide, SA   \n",
       "1                                               Argus  Melbourne, VIC   \n",
       "2                                    Brisbane Courier             QLD   \n",
       "3                                      Canberra Times             ACT   \n",
       "4                                      Colonial Times     Hobart, TAS   \n",
       "...                                               ...             ...   \n",
       "105023                                Dawn Newsletter       Perth, WA   \n",
       "105112                                     La Rondine       Perth, WA   \n",
       "105121                                 Listening Post       Perth, WA   \n",
       "105274                                  Western Argus  Kalgoorlie, WA   \n",
       "106887  North Coolgardie Herald and Miners Daily News     Menzies, WA   \n",
       "\n",
       "                           dates capture_date capture_timestamp  \n",
       "0                    1889 - 1931   2009-11-12    20091112000713  \n",
       "1                    1848 - 1954   2009-11-12    20091112000713  \n",
       "2                    1864 - 1933   2009-11-12    20091112000713  \n",
       "3                    1926 - 1954   2009-11-12    20091112000713  \n",
       "4                    1828 - 1857   2009-11-12    20091112000713  \n",
       "...                          ...          ...               ...  \n",
       "105023               1952 - 1954   2022-01-10    20220110214554  \n",
       "105112  1970 - 1974; 1983 - 1984   2022-01-10    20220110214554  \n",
       "105121               1921 - 1954   2022-01-10    20220110214554  \n",
       "105274               1894 - 1895   2022-01-10    20220110214554  \n",
       "106887               1899 - 1904   2022-01-16    20220116142742  \n",
       "\n",
       "[2120 rows x 7 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_appearance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find when a particular newspaper first appeared."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title_id</th>\n",
       "      <th>full_title</th>\n",
       "      <th>title</th>\n",
       "      <th>place</th>\n",
       "      <th>dates</th>\n",
       "      <th>capture_date</th>\n",
       "      <th>capture_timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>11</td>\n",
       "      <td>Canberra Times (ACT : 1926-1954)</td>\n",
       "      <td>Canberra Times</td>\n",
       "      <td>ACT</td>\n",
       "      <td>1926 - 1954</td>\n",
       "      <td>2009-11-12</td>\n",
       "      <td>20091112000713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9395</th>\n",
       "      <td>11</td>\n",
       "      <td>Canberra Times (ACT : 1926 - 1995)</td>\n",
       "      <td>Canberra Times</td>\n",
       "      <td>ACT</td>\n",
       "      <td>1926 - 1995</td>\n",
       "      <td>2012-12-27</td>\n",
       "      <td>20121227113753</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     title_id                          full_title           title place  \\\n",
       "3          11    Canberra Times (ACT : 1926-1954)  Canberra Times   ACT   \n",
       "9395       11  Canberra Times (ACT : 1926 - 1995)  Canberra Times   ACT   \n",
       "\n",
       "            dates capture_date capture_timestamp  \n",
       "3     1926 - 1954   2009-11-12    20091112000713  \n",
       "9395  1926 - 1995   2012-12-27    20121227113753  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_appearance.loc[first_appearance[\"title\"] == \"Canberra Times\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate an alphabetical list for easy browsing. View the [results as a Gist](https://gist.github.com/wragge/7d80507c3e7957e271c572b8f664031a)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "with Path(\"titles_list.md\").open(\"w\") as titles_list:\n",
    "    for title, group in first_appearance.groupby([\"title\", \"title_id\"]):\n",
    "        places = \" | \".join(group[\"place\"].unique())\n",
    "        titles_list.write(\n",
    "            f'<h4><a href=\"http://nla.gov.au/nla.news-title{title[1]}\">{title[0]} ({places})</a></h4>'\n",
    "        )\n",
    "        titles_list.write(\n",
    "            group.sort_values(by=\"capture_date\")[\n",
    "                [\"capture_date\", \"dates\", \"place\"]\n",
    "            ].to_html(index=False)\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save this dataset to CSV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "first_appearance.to_csv(\n",
    "    \"trove_newspaper_titles_first_appearance_2009_2021.csv\", index=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).  \n",
    "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}