{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get the archived version of a page closest to a particular date\n",
    "\n",
    "<p class=\"alert alert-info\">New to Jupyter notebooks? Try <a href=\"getting-started/Using_Jupyter_notebooks.ipynb\"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>\n",
    "\n",
    "To get the archived version of a page closest to a particular date we can use the Memento API. Variations in the way Memento is implemented across repositories are documented in [Getting data from web archives using Memento](memento.ipynb). The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.\n",
    "\n",
    "To get information about available Mementos:\n",
    "\n",
    "``` python\n",
    "query_timegate([timegate], [url], [date], [timezone])\n",
    "```\n",
    "\n",
    "To get a single Memento closest to your target date:\n",
    "\n",
    "``` python\n",
    "get_memento([timegate], [url], [date], [timezone])\n",
    "```\n",
    "\n",
    "Parameters:\n",
    "\n",
    "* `timegate` – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n",
    "* `url` – the url you want to look for in the archive\n",
    "* `date` – the target date in ISO format, 'YYYY-MM-DD' (optional, will default to most recent date)\n",
    "* `tz` – a timezone string for your local timezone (optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import arrow\n",
    "import requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These are the repositories we'll be using\n",
    "TIMEGATES = {\n",
    "    \"awa\": \"https://web.archive.org.au/awa/\",\n",
    "    \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n",
    "    \"ukwa\": \"https://www.webarchive.org.uk/wayback/en/archive/\",\n",
    "    \"ia\": \"https://web.archive.org/web/\",\n",
    "}\n",
    "\n",
    "\n",
    "def format_date_for_headers(iso_date, tz):\n",
    "    \"\"\"\n",
    "    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n",
    "    Convert the datetime to UTC and format as required by Accet-Datetime headers:\n",
    "    eg Fri, 23 Mar 2007 01:00:00 GMT\n",
    "    \"\"\"\n",
    "    local = arrow.get(f\"{iso_date} 12:00:00 {tz}\", \"YYYY-MM-DD HH:mm:ss ZZZ\")\n",
    "    gmt = local.to(\"utc\")\n",
    "    return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n",
    "\n",
    "\n",
    "def parse_links_from_headers(response):\n",
    "    \"\"\"\n",
    "    Extract Memento links from 'Link' header.\n",
    "    \"\"\"\n",
    "    links = response.links\n",
    "    return {k: v[\"url\"] for k, v in links.items()}\n",
    "\n",
    "\n",
    "def query_timegate(timegate, url, date=None, tz=\"Australia/Canberra\"):\n",
    "    \"\"\"\n",
    "    Query the specified repository for a Memento.\n",
    "    \"\"\"\n",
    "    headers = {}\n",
    "    if date:\n",
    "        formatted_date = format_date_for_headers(date, tz)\n",
    "        headers[\"Accept-Datetime\"] = formatted_date\n",
    "    elif not date:\n",
    "        formatted_date = format_date_for_headers(\n",
    "            arrow.utcnow().format(\"YYYY-MM-DD\"), tz\n",
    "        )\n",
    "        headers[\"Accept-Datetime\"] = formatted_date\n",
    "    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n",
    "    tg_url = (\n",
    "        f\"{TIMEGATES[timegate]}{url}/\"\n",
    "        if not url.endswith(\"/\")\n",
    "        else f\"{TIMEGATES[timegate]}{url}\"\n",
    "    )\n",
    "    # print(tg_url)\n",
    "    # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n",
    "    if timegate == \"ia\":\n",
    "        allow_redirects = True\n",
    "    else:\n",
    "        allow_redirects = False\n",
    "    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n",
    "    return parse_links_from_headers(response)\n",
    "\n",
    "\n",
    "def get_memento(timegate, url, date=None, tz=\"Australia/Canberra\"):\n",
    "    \"\"\"\n",
    "    If there's no memento in the results, look for an alternative.\n",
    "    \"\"\"\n",
    "    links = query_timegate(timegate, url, date, tz)\n",
    "    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n",
    "    if links:\n",
    "        if \"memento\" in links:\n",
    "            memento = links[\"memento\"]\n",
    "        elif \"prev memento\" in links:\n",
    "            memento = links[\"prev memento\"]\n",
    "        elif \"next memento\" in links:\n",
    "            memento = links[\"next memento\"]\n",
    "        elif \"last memento\" in links:\n",
    "            memento = links[\"last memento\"]\n",
    "    else:\n",
    "        memento = None\n",
    "    return memento"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "Query NZWA Timegate for information about the NLNZ home page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'original': 'http://natlib.govt.nz/',\n",
       " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/',\n",
       " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/',\n",
       " 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20220401095649mp_/http://natlib.govt.nz/'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_timegate(\"nzwa\", \"http://natlib.govt.nz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get a version of my blog from around 2005. First from the AWA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://web.archive.org.au/awa/20041126212006mp_/http://www.discontents.com.au:80/'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_memento(\"awa\", \"http://discontents.com.au\", \"2005-01-01\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then from the IA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://web.archive.org/web/20041126212006/http://www.discontents.com.au:80/'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_memento(\"ia\", \"http://discontents.com.au\", \"2005-01-01\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get a recent memento of the British Library site:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://www.webarchive.org.uk/wayback/en/archive/20220405104602mp_/https://www.bl.uk/'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_memento(\"ukwa\", \"http://bl.uk\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n",
    "\n",
    "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}