{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting data from web archives using Memento\n",
    "\n",
    "<p class=\"alert alert-info\">New to Jupyter notebooks? Try <a href=\"getting-started/Using_Jupyter_notebooks.ipynb\"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>\n",
    "\n",
    "Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across five web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, the Internet Archive, and the UK Government Web Archive. In particular we'll examine:\n",
    "\n",
    "* [Timegates](#Timegates) – request web page captures from (around) a particular date\n",
    "* [Timemaps](#Timemaps) – request a list of web archive captures from a particular url\n",
    "* [Mementos](#Mementos) – use url modifiers to change the way an archived web page is presented\n",
    "\n",
    "Notebooks using Timegates or Timemaps to access capture data include:\n",
    "\n",
    "* [Get the archived version of a page closest to a particular date](get_a_memento.ipynb)\n",
    "* [Find all the archived versions of a web page](find_all_captures.ipynb)\n",
    "* [Harvesting collections of text from archived web pages](getting_text_from_web_pages.ipynb)\n",
    "* [Compare two versions of an archived web page](show_diffs.ipynb)\n",
    "* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)\n",
    "* [Using screenshots to visualise change in a page over time](screenshots_over_time_using_timemaps.ipynb)\n",
    "* [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb)\n",
    "* [Find when a piece of text appears in an archived web page](find-text-in-page-from-timemap.ipynb)\n",
    "\n",
    "## Useful tools and documentation\n",
    "* [Memento Protocol Specification](https://tools.ietf.org/html/rfc7089)\n",
    "* [Pywb Memento implementation](https://pywb.readthedocs.io/en/latest/manual/memento.html)\n",
    "* [Memento support in IA Wayback](https://ws-dl.blogspot.com/2013/07/2013-07-15-wayback-machine-upgrades.html)\n",
    "* [Time Travel APIs](https://timetravel.mementoweb.org/guide/api/)\n",
    "* [Memento Compliance Audit of PyWB](https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html)\n",
    "* [Memento tools](http://mementoweb.org/tools/)\n",
    "* [Memento client](https://github.com/mementoweb/py-memento-client)\n",
    "* [Memgator](https://github.com/oduwsdl/MemGator) – Memento aggregator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "\n",
    "import arrow\n",
    "import requests\n",
    "\n",
    "# Alternatively use the python Memento client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These are the repositories we'll be using\n",
    "TIMEGATES = {\n",
    "    \"awa\": \"https://web.archive.org.au/awa/\",\n",
    "    \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n",
    "    \"ukwa\": \"https://www.webarchive.org.uk/wayback/archive/\",\n",
    "    \"ia\": \"https://web.archive.org/web/\",\n",
    "    \"ukgwa\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Timegates\n",
    "\n",
    "Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the `Accept-Datetime` value in the headers of your request.\n",
    "\n",
    "For example, if you wanted to query the Australian Web Archive to find the version of `http://nla.gov.au/` that was captured as close as possible to 1 January 2001, you'd set the `Accept-Datetime` header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:\n",
    "\n",
    "```\n",
    "https://web.archive.org.au/awa/http://nla.gov.au/\n",
    "```\n",
    "\n",
    "A `get` request will return the captured page, but if all you want is the url of the archived page you can use a `head` request and extract the information you need from the response headers. Try this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Server': 'nginx', 'Date': 'Thu, 23 Mar 2023 15:03:12 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144751/http://www.nla.gov.au/', 'Link': '<http://www.nla.gov.au/>; rel=\"original\", <https://web.archive.org.au/awa/http://www.nla.gov.au/>; rel=\"timegate\", <https://web.archive.org.au/awa/timemap/link/http://www.nla.gov.au/>; rel=\"timemap\"; type=\"application/link-format\", <https://web.archive.org.au/awa/20100205144751mp_/http://www.nla.gov.au/>; rel=\"memento\"; datetime=\"Fri, 05 Feb 2010 14:47:51 GMT\"', 'Vary': 'accept-datetime'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response = requests.head(\n",
    "    \"https://web.archive.org.au/awa/http://nla.gov.au/\",\n",
    "    headers={\"Accept-Datetime\": \"Fri, 01 Jan 2010 01:00:00 GMT\"},\n",
    ")\n",
    "response.headers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The request above returns the following headers:\n",
    "\n",
    "``` python\n",
    "{\n",
    "    'Server': 'nginx', \n",
    "    'Date': 'Wed, 06 May 2020 04:34:50 GMT', \n",
    "    'Content-Length': '0', 'Connection': 'keep-alive', \n",
    "    'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', \n",
    "    'Link': '<http://nla.gov.au/>; rel=\"original\", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel=\"timegate\", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel=\"timemap\"; type=\"application/link-format\", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel=\"memento\"; datetime=\"Fri, 05 Feb 2010 14:42:27 GMT\"', \n",
    "    'Vary': 'accept-datetime'\n",
    "}\n",
    "```\n",
    "\n",
    "The `Link` parameter contains the Memento information. You can see that it's actually providing information on four types of link:\n",
    "\n",
    "* the `original` url (ie the url that was archived) – `<http://nla.gov.au/>`\n",
    "* the `timegate` for the harvested url (which us what we just used) – `<https://web.archive.org.au/awa/http://nla.gov.au/>`\n",
    "* the `timemap` for the harvested url (we'll look at this below) – `<https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>`\n",
    "* the `memento` – `<https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>`\n",
    "\n",
    "The `memento` link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as `first memento`, `last memento`, `prev memento`, and `next memento`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's some functions to query a timegate in one of the five systems we're exploring. We'll use them to compare the results we get from each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_date_for_headers(iso_date, tz):\n",
    "    \"\"\"\n",
    "    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n",
    "    Convert the datetime to UTC and format as required by Accet-Datetime headers:\n",
    "    eg Fri, 23 Mar 2007 01:00:00 GMT\n",
    "    \"\"\"\n",
    "    local = arrow.get(f\"{iso_date} 12:00:00 {tz}\", \"YYYY-MM-DD HH:mm:ss ZZZ\")\n",
    "    gmt = local.to(\"utc\")\n",
    "    return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n",
    "\n",
    "\n",
    "def parse_links_from_headers(response):\n",
    "    \"\"\"\n",
    "    Extract original, timegate, timemap, and memento links from 'Link' header.\n",
    "    \"\"\"\n",
    "    links = response.links\n",
    "    return {k: v[\"url\"] for k, v in links.items()}\n",
    "\n",
    "\n",
    "def format_timestamp(timestamp, date_format=\"YYYY-MM-DD HH:mm:ss\"):\n",
    "    return arrow.get(timestamp, \"YYYYMMDDHHmmss\").format(date_format)\n",
    "\n",
    "\n",
    "def test_timegate(\n",
    "    timegate,\n",
    "    url,\n",
    "    date=None,\n",
    "    tz=\"Australia/Canberra\",\n",
    "    request_type=\"head\",\n",
    "    allow_redirects=True,\n",
    "):\n",
    "    headers = {}\n",
    "    if date:\n",
    "        formatted_date = format_date_for_headers(date, tz)\n",
    "        headers[\"Accept-Datetime\"] = formatted_date\n",
    "    # Note that you don't get a timegate response if you leave off the trailing slash\n",
    "    tg_url = (\n",
    "        f\"{TIMEGATES[timegate]}{url}/\"\n",
    "        if not url.endswith(\"/\")\n",
    "        else f\"{TIMEGATES[timegate]}{url}\"\n",
    "    )\n",
    "    print(tg_url)\n",
    "    if request_type == \"head\":\n",
    "        response = requests.head(\n",
    "            tg_url, headers=headers, allow_redirects=allow_redirects\n",
    "        )\n",
    "    else:\n",
    "        response = requests.get(\n",
    "            tg_url, headers=headers, allow_redirects=allow_redirects\n",
    "        )\n",
    "    response.raise_for_status()\n",
    "    # print(response.headers)\n",
    "    return parse_links_from_headers(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Australian Web Archive\n",
    "\n",
    "A `HEAD` request that follows redirects returns no results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org.au/awa/http://www.nla.gov.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"awa\", \"http://www.nla.gov.au\")\n",
    "\n",
    "# Test for expected result\n",
    "assert result == {}\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "A `HEAD` request that doesn't follow redirects returns results as expected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org.au/awa/http://www.nla.gov.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'https://www.nla.gov.au/',\n",
       " 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/',\n",
       " 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/',\n",
       " 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"awa\", \"http://www.nla.gov.au\", allow_redirects=False)\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "A query without an `Accept-Datetime` value returns a recent capture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org.au/awa/http://www.nla.gov.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'https://www.nla.gov.au/',\n",
       " 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/',\n",
       " 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/',\n",
       " 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"awa\", \"http://www.nla.gov.au\", allow_redirects=False)\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A query with an `Accept-Datetime` value of 1 January 2002 returns a capture from 20 January 2002."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org.au/awa/http://www.education.gov.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://www.education.gov.au:80/',\n",
       " 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',\n",
       " 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',\n",
       " 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\n",
    "    \"awa\", \"http://www.education.gov.au/\", date=\"2002-01-01\", allow_redirects=False\n",
    ")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "assert \"20020120\" in result[\"memento\"]\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Using a `GET` rather than a `HEAD` request returns no Memento information when redirects are followed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org.au/awa/http://www.education.gov.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\n",
    "    \"awa\", \"http://www.education.gov.au/\", date=\"2002-01-01\", request_type=\"get\"\n",
    ")\n",
    "\n",
    "# Test for expected result\n",
    "assert result == {}\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Using a `GET` rather than a `HEAD` request returns Memento information when redirects are not followed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org.au/awa/http://www.education.gov.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://www.education.gov.au:80/',\n",
       " 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',\n",
       " 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',\n",
       " 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\n",
    "    \"awa\",\n",
    "    \"http://www.education.gov.au/\",\n",
    "    date=\"2002-01-01\",\n",
    "    request_type=\"get\",\n",
    "    allow_redirects=False,\n",
    ")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### New Zealand Web Archive\n",
    "\n",
    "Changing whether or not redirects are followed has no effect on any of these responses.\n",
    "\n",
    "A query without an `Accept-Datetime` returns a recent capture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = test_timegate(\"nzwa\", \"http://natlib.govt.nz\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A query with an `Accept-Datetime` value of 1 January 2005 returns a `memento` from July 2004."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = test_timegate(\"nzwa\", \"http://natlib.govt.nz\", date=\"2005-01-01\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "assert \"20040711\" in result[\"memento\"]\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A `GET` request returns the same results as a `HEAD` request."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_head = test_timegate(\"nzwa\", \"http://natlib.govt.nz\", date=\"2005-01-01\")\n",
    "result_get = test_timegate(\n",
    "    \"nzwa\", \"http://natlib.govt.nz\", date=\"2005-01-01\", request_type=\"get\"\n",
    ")\n",
    "\n",
    "# Test for expected result\n",
    "assert result_head == result_get\n",
    "\n",
    "result_get"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Internet Archive\n",
    "\n",
    "Using a `HEAD` request that follows redirects returns results as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org/web/http://discontents.com.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://discontents.com.au/',\n",
       " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',\n",
       " 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',\n",
       " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n",
       " 'prev memento': 'https://web.archive.org/web/20230313181957/https://discontents.com.au/',\n",
       " 'memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/',\n",
       " 'last memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/'}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ia\", \"http://discontents.com.au\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "# IA responses have additional fields\n",
    "assert \"first memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Using a `HEAD` request returns no Memento information if redirects are not followed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org/web/http://discontents.com.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ia\", \"http://discontents.com.au\", allow_redirects=False)\n",
    "\n",
    "# Test for expected result\n",
    "assert result == {}\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A query without an `Accept-Datetime` value returns a `memento` and also includes a `first memento`, `last memento`, `prev memento`, and `last memento`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org/web/http://discontents.com.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://discontents.com.au/',\n",
       " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',\n",
       " 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',\n",
       " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n",
       " 'prev memento': 'https://web.archive.org/web/20220323201952/http://www.discontents.com.au/',\n",
       " 'memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/',\n",
       " 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ia\", \"http://discontents.com.au\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "# IA responses have additional fields\n",
    "assert \"first memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A query with an `Accept-Datetime` value of 1 January 2010 returns a `memento` from 9 February 2010."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org/web/http://discontents.com.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://discontents.com.au:80/',\n",
       " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',\n",
       " 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',\n",
       " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n",
       " 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',\n",
       " 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',\n",
       " 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',\n",
       " 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ia\", \"http://discontents.com.au\", date=\"2010-01-01\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "assert \"20100209\" in result[\"memento\"]\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "`GET` requests return different results if redirects are not followed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://web.archive.org/web/http://discontents.com.au/\n",
      "https://web.archive.org/web/http://discontents.com.au/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://discontents.com.au/',\n",
       " 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au/',\n",
       " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/'}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\n",
    "    \"ia\", \"http://discontents.com.au\", date=\"2010-01-01\", request_type=\"get\"\n",
    ")\n",
    "result_no_redirects = test_timegate(\n",
    "    \"ia\",\n",
    "    \"http://discontents.com.au\",\n",
    "    date=\"2010-01-01\",\n",
    "    request_type=\"get\",\n",
    "    allow_redirects=False,\n",
    ")\n",
    "\n",
    "# Test for expected result\n",
    "assert result != result_no_redirects\n",
    "\n",
    "result_no_redirects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UK Web Archive\n",
    "\n",
    "Changing whether or not redirects are followed has no effect on any of these responses.\n",
    "\n",
    "A query without an `Accept-Datetime` value returns a recent capture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'https://www.bl.uk/',\n",
       " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/https://www.bl.uk/',\n",
       " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/https://www.bl.uk/',\n",
       " 'memento': 'https://www.webarchive.org.uk/wayback/archive/20230319105859mp_/https://www.bl.uk/'}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ukwa\", \"http://bl.uk\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A query with an `Accept-Datetime` value of 1 January 2006 returns a `memento` from 4 May 2004."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://www.bl.uk/',\n",
       " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/',\n",
       " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/',\n",
       " 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ukwa\", \"http://bl.uk\", date=\"2006-01-01\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "assert \"20040504\" in result[\"memento\"]\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A `GET` request returns the same results as a `HEAD` request."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n",
      "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://www.bl.uk/',\n",
       " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/',\n",
       " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/',\n",
       " 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_head = test_timegate(\"ukwa\", \"http://bl.uk\", date=\"2006-01-01\")\n",
    "result_get = test_timegate(\n",
    "    \"ukwa\", \"http://bl.uk\", date=\"2006-01-01\", request_type=\"get\"\n",
    ")\n",
    "\n",
    "# Test for expected result\n",
    "assert result_head == result_get\n",
    "\n",
    "result_get"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UK Government Web Archive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Changing whether or not redirects are followed has no effect on any of these responses.\n",
    "\n",
    "A query without an `Accept-Datetime`  value returns a recent capture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'https://www.nationalarchives.gov.uk/',\n",
       " 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/',\n",
       " 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk/',\n",
       " 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20230311073241mp_/https://www.nationalarchives.gov.uk/'}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A query with an `Accept-Datetime` value of 1 January 2006 returns a `memento` from 13 February 2006."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://www.nationalarchives.gov.uk/',\n",
       " 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/',\n",
       " 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/',\n",
       " 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = test_timegate(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\", date=\"2006-01-01\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "assert \"20060213\" in result[\"memento\"]\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "A `GET` request returns the same results as a `HEAD` request."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/\n",
      "https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'original': 'http://www.nationalarchives.gov.uk/',\n",
       " 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/',\n",
       " 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/',\n",
       " 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_head = test_timegate(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\", date=\"2006-01-01\")\n",
    "result_get = test_timegate(\n",
    "    \"ukgwa\", \"https://www.nationalarchives.gov.uk/\", date=\"2006-01-01\", request_type=\"get\")\n",
    "\n",
    "# Test for expected result\n",
    "assert result_head == result_get\n",
    "\n",
    "result_get"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summarising the differences\n",
    "\n",
    "As you can see above, there are a couple of significant differences in the way that Timegates behave across the five repositories.\n",
    "\n",
    "* Wayback systems (IA) provide more information than the Pywb systems (`first memento`, `last memento`, `prev memento`, and `last memento`)\n",
    "* You can use either `HEAD` or `GET` with UKWA, NZWA, and UKGWA, but IA and AWA behave different depending on the type of request and whether redirects are followed. To get results from either a `HEAD` or `GET` request, AWA requests should not follow redirects. To get results from a `HEAD` requests, IA requests should follow redirects. `GET` requests to IA will return results whether or not redirects are allowed, however, those results differ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalising Timegate responses and queries\n",
    "\n",
    "Here's some code to smooth out the differences between systems, and return Memento data as a Python dictionary. Specifically it:\n",
    "\n",
    "* Follows redirects for requests to the IA.\n",
    "* If there is no `memento` value in the response (as sometimes happens with NLNZ), it looks for a `first`, `last`, `prev` or `next` value instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_timegate(timegate, url, date=None, tz=\"Australia/Canberra\"):\n",
    "    \"\"\"\n",
    "    Query the specified repository for a Memento.\n",
    "    \"\"\"\n",
    "    headers = {}\n",
    "    if date:\n",
    "        formatted_date = format_date_for_headers(date, tz)\n",
    "        headers[\"Accept-Datetime\"] = formatted_date\n",
    "    \n",
    "    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n",
    "    tg_url = (\n",
    "        f\"{TIMEGATES[timegate]}{url}/\"\n",
    "        if not url.endswith(\"/\")\n",
    "        else f\"{TIMEGATES[timegate]}{url}\"\n",
    "    )\n",
    "    # print(tg_url)\n",
    "    # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n",
    "    if timegate == \"ia\":\n",
    "        allow_redirects = True\n",
    "    else:\n",
    "        allow_redirects = False\n",
    "    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n",
    "    response.raise_for_status()\n",
    "    return parse_links_from_headers(response)\n",
    "\n",
    "\n",
    "def get_memento(timegate, url, date=None, tz=\"Australia/Canberra\"):\n",
    "    \"\"\"\n",
    "    If there's no memento in the results, look for an alternative.\n",
    "    \"\"\"\n",
    "    links = query_timegate(timegate, url, date, tz)\n",
    "    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n",
    "    if links:\n",
    "        if \"memento\" in links:\n",
    "            memento = links[\"memento\"]\n",
    "        elif \"prev memento\" in links:\n",
    "            memento = links[\"prev memento\"]\n",
    "        elif \"next memento\" in links:\n",
    "            memento = links[\"next memento\"]\n",
    "        elif \"last memento\" in links:\n",
    "            memento = links[\"last memento\"]\n",
    "    else:\n",
    "        memento = None\n",
    "    return memento"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can request a Memento from any of the five repositories and get back the results as a Python dictionary. You can see this code in action in the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'original': 'http://nationalarchives.gov.uk/',\n",
       " 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://nationalarchives.gov.uk/',\n",
       " 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://nationalarchives.gov.uk/',\n",
       " 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20141223091614mp_/http://nationalarchives.gov.uk/'}"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = query_timegate(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\", date=\"2015-01-01\")\n",
    "\n",
    "# Test for expected result\n",
    "assert \"memento\" in result\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or if we just want to get the url for a Memento (and fallback to alternative values if `memento` is missing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://ndhadeliver.natlib.govt.nz/webarchive/20220801082654mp_/http://natlib.govt.nz/'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = get_memento(\"nzwa\", \"http://natlib.govt.nz\")\n",
    "\n",
    "# Test for expected result\n",
    "assert result.startswith(\"https://ndhadeliver.natlib.govt.nz/webarchive/\")\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "## Timemaps\n",
    "\n",
    "Memento Timemaps provide machine-processable lists of web page captures from a particular archive. They are available from both OpenWayback and Pywb systems, though there are some differences. The [Pywb documentation](https://pywb.readthedocs.io/en/latest/manual/memento.html#timemap-api) notes that the following formats are available:\n",
    "\n",
    "* link – returns an application/link-format as required by the Memento spec\n",
    "* cdxj – returns a timemap in the native CDXJ format\n",
    "* json – returns the timemap as newline-delimited JSON lines (NDJSON) format\n",
    "\n",
    "Timemaps are requested using a url with the following format:\n",
    "\n",
    "```\n",
    "http://[address.of.archive]/[collection]/timemap/[format]/[web page url]\n",
    "```\n",
    "\n",
    "So if you wanted to query the Australian Web Archive to get a list of captures in JSON format from http://nla.gov.au/ you'd use this url:\n",
    "\n",
    "```\n",
    "https://web.archive.org.au/awa/timemap/json/http://nla.gov.au/\n",
    "```\n",
    "\n",
    "The examples below show how the format and behaviour of Timemaps vary slightly across the five respoitories we're interested in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_timemap(timegate, url, format=\"json\"):\n",
    "    \"\"\"\n",
    "    Basic function to get a Timemap for the supplied url.\n",
    "    \"\"\"\n",
    "    tg_url = f\"{TIMEGATES[timegate]}timemap/{format}/{url}/\"\n",
    "    response = requests.get(tg_url)\n",
    "    response.raise_for_status()\n",
    "    # Show the content-type\n",
    "    # print(response.headers['content-type'])\n",
    "    return response.headers[\"content-type\"], response.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### National Library of Australia\n",
    "\n",
    "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "application/link-format\n",
      "<https://web.archive.org.au/awa/timemap/link/http://www.gov.au/>; rel=\"self\"; type=\"application/link-format\"; from=\"Wed, 06 Dec 2000 21:15:00 GMT\",\n",
      "<https://web.archive.org.au/awa/http://www.gov.au/>; rel=\"timegate\",\n",
      "<http://www.gov.au/>; rel=\"original\",\n",
      "<https://web.archive.org.au/awa/20001206211500mp_/http://www.gov.au/>; rel=\"memento\"; datetime=\"Wed, 06 Dec 2000 21:15:00 GMT\"; collection=\"awa\",\n",
      "<https://web.archive.org.au/awa/20010118203600mp_/http://www.gov.au/>; rel=\"memento\"; datetime=\"Thu, 18 Jan 2001 20:36:00 GMT\"; collection=\"awa\",\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"awa\", \"http://www.gov.au\", \"link\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"application/link-format\"\n",
    "\n",
    "# Show the first 5 lines\n",
    "print(\"\\n\".join(timemap.splitlines()[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-ndjson\n",
      "{\"urlkey\": \"au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm\", \"timestamp\": \"20031122074837\", \"url\": \"http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE\", \"offset\": \"97170362\", \"filename\": \"NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz\", \"length\": \"3446\", \"source\": \"awa\", \"source-coll\": \"awa\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\n",
    "    \"awa\",\n",
    "    \"http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm\",\n",
    "    \"json\",\n",
    ")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-ndjson\"\n",
    "\n",
    "# Show the first line\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-cdxj\n",
      "au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm 20031122074837 {\"url\": \"http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE\", \"offset\": \"97170362\", \"filename\": \"NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz\", \"length\": \"3446\", \"source\": \"awa\", \"source-coll\": \"awa\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\n",
    "    \"awa\",\n",
    "    \"http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm\",\n",
    "    \"cdxj\",\n",
    ")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-cdxj\"\n",
    "\n",
    "# Show the first line\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UK Web Archive\n",
    "\n",
    "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "application/link-format\n",
      "<https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/>; rel=\"self\"; type=\"application/link-format\"; from=\"Tue, 30 Oct 2001 00:00:19 GMT\",\n",
      "<https://www.webarchive.org.uk/wayback/archive/http://bl.uk/>; rel=\"timegate\",\n",
      "<http://bl.uk/>; rel=\"original\",\n",
      "<https://www.webarchive.org.uk/wayback/archive/20011030000019mp_/http://www.bl.uk/>; rel=\"memento\"; datetime=\"Tue, 30 Oct 2001 00:00:19 GMT\"; collection=\"archive\",\n",
      "<https://www.webarchive.org.uk/wayback/archive/20011113000000mp_/http://www.bl.uk/>; rel=\"memento\"; datetime=\"Tue, 13 Nov 2001 00:00:00 GMT\"; collection=\"archive\",\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ukwa\", \"http://bl.uk\", \"link\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"application/link-format\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-ndjson\n",
      "{\"urlkey\": \"uk,bl)/\", \"timestamp\": \"20011030000019\", \"url\": \"http://www.bl.uk/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"0\", \"offset\": \"10813988\", \"filename\": \"/data/102148/31031347/WARCS/BL-31031347.warc.gz\", \"load_url\": \"\", \"source\": \"archive\", \"source-coll\": \"archive\", \"access\": \"allow\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ukwa\", \"http://bl.uk\", \"json\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-ndjson\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-cdxj\n",
      "uk,bl)/ 20011030000019 {\"url\": \"http://www.bl.uk/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"0\", \"offset\": \"10813988\", \"filename\": \"/data/102148/31031347/WARCS/BL-31031347.warc.gz\", \"load_url\": \"\", \"source\": \"archive\", \"source-coll\": \"archive\", \"access\": \"allow\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ukwa\", \"http://bl.uk\", \"cdxj\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-cdxj\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UK Government Web Archive\n",
    "\n",
    "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "application/link-format\n",
      "<https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk//>; rel=\"self\"; type=\"application/link-format\"; from=\"Mon, 20 Oct 2003 01:04:12 GMT\",\n",
      "<https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk//>; rel=\"timegate\",\n",
      "<https://www.nationalarchives.gov.uk//>; rel=\"original\",\n",
      "<https://webarchive.nationalarchives.gov.uk/ukgwa/20031020010412mp_/http://www.nationalarchives.gov.uk:80/>; rel=\"memento\"; datetime=\"Mon, 20 Oct 2003 01:04:12 GMT\"; collection=\"full_zipnum\",\n",
      "<https://webarchive.nationalarchives.gov.uk/ukgwa/20040104233258mp_/http://www.nationalarchives.gov.uk/>; rel=\"memento\"; datetime=\"Sun, 04 Jan 2004 23:32:58 GMT\"; collection=\"full_zipnum\",\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\", \"link\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"application/link-format\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-ndjson\n",
      "{\"urlkey\": \"uk,gov,nationalarchives)/\", \"timestamp\": \"20031020010412\", \"url\": \"http://www.nationalarchives.gov.uk:80/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"951\", \"offset\": \"898\", \"filename\": \"UKGOV-WEEKLY-010-031019180412-000.warc.gz\", \"source\": \"full_zipnum\", \"source-coll\": \"full_zipnum\", \"access\": \"allow\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\", \"json\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-ndjson\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-cdxj\n",
      "uk,gov,nationalarchives)/ 20031020010412 {\"url\": \"http://www.nationalarchives.gov.uk:80/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"951\", \"offset\": \"898\", \"filename\": \"UKGOV-WEEKLY-010-031019180412-000.warc.gz\", \"source\": \"full_zipnum\", \"source-coll\": \"full_zipnum\", \"access\": \"allow\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ukgwa\", \"https://www.nationalarchives.gov.uk/\", \"cdxj\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-cdxj\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### National Library of New Zealand\n",
    "\n",
    "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "application/link-format\n",
      "<https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/>; rel=\"self\"; type=\"application/link-format\"; from=\"Sun, 11 Jul 2004 21:32:25 GMT\",\n",
      "<https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/>; rel=\"timegate\",\n",
      "<http://natlib.govt.nz/>; rel=\"original\",\n",
      "<https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225mp_/http://www.natlib.govt.nz/>; rel=\"memento\"; datetime=\"Sun, 11 Jul 2004 21:32:25 GMT\"; collection=\"webarchive\",\n",
      "<https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135mp_/http://www.natlib.govt.nz/>; rel=\"memento\"; datetime=\"Tue, 04 Jul 2006 03:31:35 GMT\"; collection=\"webarchive\",\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"nzwa\", \"http://natlib.govt.nz\", \"link\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"application/link-format\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `json` format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-ndjson\n",
      "{\"urlkey\": \"nz,govt,natlib)/\", \"timestamp\": \"20040711213225\", \"url\": \"http://www.natlib.govt.nz/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"0\", \"offset\": \"976\", \"filename\": \"V1-FL1645590.arc\", \"load_url\": \"http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/\", \"source\": \"webarchive\", \"source-coll\": \"webarchive\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"nzwa\", \"http://natlib.govt.nz\", \"json\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-ndjson\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/x-cdxj\n",
      "nz,govt,natlib)/ 20040711213225 {\"url\": \"http://www.natlib.govt.nz/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"0\", \"offset\": \"976\", \"filename\": \"V1-FL1645590.arc\", \"load_url\": \"http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/\", \"source\": \"webarchive\", \"source-coll\": \"webarchive\"}\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"nzwa\", \"http://natlib.govt.nz\", \"cdxj\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/x-cdxj\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Internet Archive\n",
    "\n",
    "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "application/link-format\n",
      "<http://www.discontents.com.au:80/>; rel=\"original\",\n",
      "<https://web.archive.org/web/timemap/link/http://discontents.com.au/>; rel=\"self\"; type=\"application/link-format\"; from=\"Sun, 06 Dec 1998 01:22:33 GMT\",\n",
      "<https://web.archive.org/web/http://discontents.com.au/>; rel=\"timegate\",\n",
      "<https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/>; rel=\"first memento\"; datetime=\"Sun, 06 Dec 1998 01:22:33 GMT\",\n",
      "<https://web.archive.org/web/19981212024410/http://www.discontents.com.au:80/>; rel=\"memento\"; datetime=\"Sat, 12 Dec 1998 02:44:10 GMT\",\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ia\", \"http://discontents.com.au\", \"link\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"application/link-format\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request for timemap in `json` format returns results in JSON as an array of arrays, where the first row provides the column headings. Response headers include `content-type` of `application/json`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "application/json\n",
      "[[\"urlkey\",\"timestamp\",\"original\",\"mimetype\",\"statuscode\",\"digest\",\"redirect\",\"robotflags\",\"length\",\"offset\",\"filename\"],\n",
      "[\"au,com,discontents)/\",\"19981206012233\",\"http://www.discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1610\",\"43993900\",\"green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz\"],\n",
      "[\"au,com,discontents)/\",\"19981212024410\",\"http://www.discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1613\",\"17792789\",\"slash-913417727-c/slash-913430608.arc.gz\"],\n",
      "[\"au,com,discontents)/\",\"19990125094813\",\"http://www.discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1613\",\"11419234\",\"slash-913417727-c/slash_19990124232053-917257670.arc.gz\"],\n",
      "[\"au,com,discontents)/\",\"19990208004052\",\"http://discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1612\",\"13269748\",\"slash-913417727-c/slash-918434425.arc.gz\"],\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ia\", \"http://discontents.com.au\", \"json\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"application/json\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Request for timemap in `cdxj` returns results in plain text, with fields separated by spaces, and captures separated by line breaks. Response headers include `content-type` of `text/plain`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text/plain\n",
      "au,com,discontents)/ 19981206012233 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1610 43993900 green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz\n"
     ]
    }
   ],
   "source": [
    "content_type, timemap = get_timemap(\"ia\", \"http://discontents.com.au\", \"cdxj\")\n",
    "\n",
    "print(content_type)\n",
    "# Test content type\n",
    "assert content_type == \"text/plain\"\n",
    "\n",
    "print(\"\\n\".join(timemap.splitlines()[:1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Differences in field labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we compare the Pywb JSON output with the IA Wayback output, we see there are also some differences in the field labels. In particular `original` in IA Wayback is just `url` in Pywb, while `statuscode` and `mimetype` are shortened to `status` and `mime` in Pywb."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['urlkey',\n",
       " 'timestamp',\n",
       " 'original',\n",
       " 'mimetype',\n",
       " 'statuscode',\n",
       " 'digest',\n",
       " 'redirect',\n",
       " 'robotflags',\n",
       " 'length',\n",
       " 'offset',\n",
       " 'filename']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_, timemap = get_timemap(\"ia\", \"http://bl.uk\", \"json\")\n",
    "data = json.loads(timemap)\n",
    "\n",
    "# Test for `mimetype` label\n",
    "assert \"mimetype\" in data[0]\n",
    "\n",
    "data[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['urlkey',\n",
       " 'timestamp',\n",
       " 'url',\n",
       " 'mime',\n",
       " 'status',\n",
       " 'digest',\n",
       " 'redirect',\n",
       " 'robotflags',\n",
       " 'length',\n",
       " 'offset',\n",
       " 'filename',\n",
       " 'load_url',\n",
       " 'source',\n",
       " 'source-coll',\n",
       " 'access']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_, timemap = get_timemap(\"ukwa\", \"http://bl.uk\", \"json\")\n",
    "data = [json.loads(line) for line in timemap.splitlines()]\n",
    "\n",
    "# Test for `mime` label\n",
    "assert \"mime\" in data[0]\n",
    "\n",
    "list(data[0].keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summarising the differences\n",
    "\n",
    "The good news is that all repositories provide Timemaps in the standard `link` format as required by the Memento specification. However, there's more varation when it comes to other formats.\n",
    "\n",
    "* IA's `json` format is different to the Pywb format from UKWA, UKGWA, NLNZ, and NLA. \n",
    "* IA uses different labels for some values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalising Timemaps\n",
    "\n",
    "With the information above we can construct some functions to return normalised Timemap results as JSON. To do this we need to:\n",
    "\n",
    "* Restructure the JSON output from IA to match the Pywb format\n",
    "* Change some of the column headings in the IA data to match the Pywb format\n",
    "\n",
    "Because the `link` format provides less information than the `json` format, we could also try to enrich the NLNZ data by requesting more information about individual Mementos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_lists_to_dicts(results):\n",
    "    \"\"\"\n",
    "    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n",
    "    Renames keys to standardise IA with other Timemaps.\n",
    "    \"\"\"\n",
    "    if results:\n",
    "        keys = results[0]\n",
    "        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n",
    "    else:\n",
    "        results_as_dicts = results\n",
    "    for d in results_as_dicts:\n",
    "        d[\"status\"] = d.pop(\"statuscode\")\n",
    "        d[\"mime\"] = d.pop(\"mimetype\")\n",
    "        d[\"url\"] = d.pop(\"original\")\n",
    "    return results_as_dicts\n",
    "\n",
    "\n",
    "def get_capture_data_from_memento(url, request_type=\"head\"):\n",
    "    \"\"\"\n",
    "    For OpenWayback systems this can get some extra capture info to insert into Timemaps.\n",
    "    \"\"\"\n",
    "    if request_type == \"head\":\n",
    "        response = requests.head(url)\n",
    "    else:\n",
    "        response = requests.get(url)\n",
    "    headers = response.headers\n",
    "    length = headers.get(\"x-archive-orig-content-length\")\n",
    "    status = headers.get(\"x-archive-orig-status\")\n",
    "    status = status.split(\" \")[0] if status else None\n",
    "    mime = headers.get(\"x-archive-orig-content-type\")\n",
    "    mime = mime.split(\";\")[0] if mime else None\n",
    "    return {\"length\": length, \"status\": status, \"mime\": mime}\n",
    "\n",
    "\n",
    "def convert_link_to_json(results, enrich_data=False):\n",
    "    \"\"\"\n",
    "    Converts link formatted Timemap to JSON.\n",
    "\n",
    "    This was originally needed for NLNZ, but now all five archives\n",
    "    return JSON data.\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for line in results.splitlines():\n",
    "        parts = line.split(\"; \")\n",
    "        if len(parts) > 1:\n",
    "            link_type = re.search(\n",
    "                r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n",
    "                parts[1],\n",
    "            ).group(1)\n",
    "            if link_type == \"memento\":\n",
    "                link = parts[0].strip(\"<>\")\n",
    "                timestamp, original = re.search(r\"/(\\d{12}|\\d{14})/(.*)$\", link).groups()\n",
    "                capture = {\"timestamp\": timestamp, \"url\": original}\n",
    "                if enrich_data:\n",
    "                    capture.update(get_capture_data_from_memento(link))\n",
    "                    # print(capture)\n",
    "                data.append(capture)\n",
    "    return data\n",
    "\n",
    "\n",
    "def get_timemap_as_json(timegate, url):\n",
    "    \"\"\"\n",
    "    Get a Timemap then normalise results (if necessary) to return a list of dicts.\n",
    "    \"\"\"\n",
    "    tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n",
    "    response = requests.get(tg_url)\n",
    "    response.raise_for_status()\n",
    "    response_type = response.headers[\"content-type\"]\n",
    "    # print(response_type)\n",
    "    if response_type == \"text/x-ndjson\":\n",
    "        data = [json.loads(line) for line in response.text.splitlines()]\n",
    "    elif response_type == \"application/json\":\n",
    "        data = convert_lists_to_dicts(response.json())\n",
    "    elif response_type in [\"application/link-format\", \"text/html;charset=utf-8\"]:\n",
    "        data = convert_link_to_json(response.text)\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can get information about captures in a standardised JSON format from all five repositories. You can see this in action in the [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb) notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'urlkey': 'uk,bl)/',\n",
       " 'timestamp': '20011030000019',\n",
       " 'url': 'http://www.bl.uk/',\n",
       " 'mime': 'text/html',\n",
       " 'status': '200',\n",
       " 'digest': 'JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW',\n",
       " 'redirect': '-',\n",
       " 'robotflags': '-',\n",
       " 'length': '0',\n",
       " 'offset': '10813988',\n",
       " 'filename': '/data/102148/31031347/WARCS/BL-31031347.warc.gz',\n",
       " 'load_url': '',\n",
       " 'source': 'archive',\n",
       " 'source-coll': 'archive',\n",
       " 'access': 'allow'}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "timemap = get_timemap_as_json(\"ukwa\", \"http://bl.uk\")\n",
    "\n",
    "# Test for `mime` label\n",
    "assert \"mime\" in timemap[0]\n",
    "\n",
    "timemap[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'urlkey': 'uk,bl)/',\n",
       " 'timestamp': '19970218190613',\n",
       " 'digest': 'Z42UMUL76GODKO3EMNSLXDTCST66VDAX',\n",
       " 'redirect': '-',\n",
       " 'robotflags': '-',\n",
       " 'length': '1208',\n",
       " 'offset': '19524651',\n",
       " 'filename': 'GR-001114-c/GR-002277.arc.gz',\n",
       " 'status': '200',\n",
       " 'mime': 'text/html',\n",
       " 'url': 'http://www.bl.uk:80/'}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "timemap = get_timemap_as_json(\"ia\", \"http://bl.uk\")\n",
    "\n",
    "# Test for `mime` label\n",
    "assert \"mime\" in timemap[0]\n",
    "\n",
    "timemap[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "## Mementos\n",
    "\n",
    "You can also modify the url of a Memento to change the way it's presented. In particular, adding `id_` after the timestamp will tell the server that you want the original harvested version of the webpage, without any rewriting of links, or web archive navigation features. For example:\n",
    "\n",
    "```\n",
    "https://web.archive.org.au/awa/20200302223537id_/http://discontents.com.au/\n",
    "```\n",
    "\n",
    "This works with all five repositories, however, note that for the Australian Web Archive you need to use the `web.archive.org.au` domain, not `webarchive.nla.gov.au`.\n",
    "\n",
    "In addition, IA supports the `if_` option, which provides a view of the archived page without web archive headers navigation inserted, but with links to CSS, JS, and images rewritten to point to archived versions. This is as close as you can get to looking at the original page, and I've used it in the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook. Note that if you add `if_` to requests from the UKWA, NLNZ, or the NLA you'll be redirected to the standard view with the original page framed by the web archive navigation.\n",
    "\n",
    "Pywb's page on [url rewriting](https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting) has some useful information about this."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n",
    "\n",
    "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n",
    "\n",
    "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}