{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get the archived version of a page closest to a particular date\n",
    "\n",
    "<p class=\"alert alert-info\">New to Jupyter notebooks? Try <a href=\"getting-started/Using_Jupyter_notebooks.ipynb\"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>\n",
    "\n",
    "To get the archived version of a page closest to a particular date we can use the Memento API. Variations in the way Memento is implemented across repositories are documented in [Getting data from web archives using Memento](memento.ipynb). The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.\n",
    "\n",
    "To get information about available Mementos:\n",
    "\n",
    "``` python\n",
    "query_timegate([timegate], [url], [date], [timezone])\n",
    "```\n",
    "\n",
    "To get a single Memento closest to your target date:\n",
    "\n",
    "``` python\n",
    "get_memento([timegate], [url], [date], [timezone])\n",
    "```\n",
    "\n",
    "Parameters:\n",
    "\n",
    "* `timegate` – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n",
    "* `url` – the url you want to look for in the archive\n",
    "* `date` – the target date in ISO format, 'YYYY-MM-DD' (optional, will default to most recent date)\n",
    "* `tz` – a timezone string for your local timezone (optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import arrow\n",
    "import re\n",
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These are the repositories we'll be using\n",
    "TIMEGATES = {\n",
    "    'awa': 'https://web.archive.org.au/awa/',\n",
    "    'nzwa': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',\n",
    "    'ukwa': 'https://www.webarchive.org.uk/wayback/en/archive/',\n",
    "    'ia': 'https://web.archive.org/web/'\n",
    "}\n",
    "\n",
    "def format_date_for_headers(iso_date, tz):\n",
    "    '''\n",
    "    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n",
    "    Convert the datetime to UTC and format as required by Accet-Datetime headers:\n",
    "    eg Fri, 23 Mar 2007 01:00:00 GMT\n",
    "    '''\n",
    "    local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')\n",
    "    gmt = local.to('utc')\n",
    "    return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n",
    "\n",
    "def parse_links_from_headers(response):\n",
    "    '''\n",
    "    Extract Memento links from 'Link' header.\n",
    "    '''\n",
    "    links = response.links\n",
    "    return {k: v['url'] for k, v in links.items()}\n",
    "\n",
    "def query_timegate(timegate, url, date=None, tz='Australia/Canberra'):\n",
    "    '''\n",
    "    Query the specified repository for a Memento.\n",
    "    '''\n",
    "    headers = {}\n",
    "    if date:\n",
    "        formatted_date = format_date_for_headers(date, tz)\n",
    "        headers['Accept-Datetime'] = formatted_date\n",
    "    # BL & NLNZ don't seem to default to latest date if no date supplied\n",
    "    elif not date and timegate in ['bl', 'nlnz']:\n",
    "        formatted_date = format_date_for_headers(arrow.utcnow().format('YYYY-MM-DD'), tz)\n",
    "        headers['Accept-Datetime'] = formatted_date\n",
    "    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n",
    "    tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'\n",
    "    # print(tg_url)\n",
    "    # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n",
    "    if timegate == 'ia':\n",
    "        allow_redirects = True\n",
    "    else:\n",
    "        allow_redirects = False\n",
    "    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n",
    "    return parse_links_from_headers(response)\n",
    "\n",
    "def get_memento(timegate, url, date=None, tz='Australia/Canberra'):\n",
    "    '''\n",
    "    If there's no memento in the results, look for an alternative.\n",
    "    '''\n",
    "    links = query_timegate(timegate, url, date, tz)\n",
    "    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n",
    "    if links:\n",
    "        if 'memento' in links:\n",
    "            memento = links['memento']\n",
    "        elif 'prev memento' in links:\n",
    "            memento = links['prev memento']\n",
    "        elif 'next memento' in links:\n",
    "            memento = links['next memento']\n",
    "        elif 'last memento' in links:\n",
    "            memento = links['last memento']\n",
    "    else:\n",
    "        memento = None\n",
    "    return memento"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "Query NZWA Timegate for information about the NLNZ home page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'original': 'http://natlib.govt.nz/',\n",
       " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',\n",
       " 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/',\n",
       " 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',\n",
       " 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/http://natlib.govt.nz/'}"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_timegate('nzwa', 'http://natlib.govt.nz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get a version of my blog from around 2005. First from the AWA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://web.archive.org.au/awa/20041126212006mp_/http://www.discontents.com.au:80/'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_memento('awa', 'http://discontents.com.au', '2005-01-01')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then from the IA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://web.archive.org/web/20041126212006/http://www.discontents.com.au:80/'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_memento('ia', 'http://discontents.com.au', '2005-01-01')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io).\n",
    "\n",
    "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}