{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get the archived version of a page closest to a particular date\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "To get the archived version of a page closest to a particular date we can use the Memento API. Variations in the way Memento is implemented across repositories are documented in [Getting data from web archives using Memento](memento.ipynb). The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.\n", "\n", "To get information about available Mementos:\n", "\n", "``` python\n", "query_timegate([timegate], [url], [date], [timezone])\n", "```\n", "\n", "To get a single Memento closest to your target date:\n", "\n", "``` python\n", "get_memento([timegate], [url], [date], [timezone])\n", "```\n", "\n", "Parameters:\n", "\n", "* `timegate` – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n", "* `url` – the url you want to look for in the archive\n", "* `date` – the target date in ISO format, 'YYYY-MM-DD' (optional, will default to most recent date)\n", "* `tz` – a timezone string for your local timezone (optional)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import arrow\n", "import re\n", "import json" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " 'awa': 'https://web.archive.org.au/awa/',\n", " 'nzwa': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',\n", " 'ukwa': 'https://www.webarchive.org.uk/wayback/en/archive/',\n", " 'ia': 'https://web.archive.org/web/'\n", "}\n", "\n", "def format_date_for_headers(iso_date, tz):\n", " '''\n", " Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n", " Convert the datetime to UTC and format as required by Accet-Datetime headers:\n", " eg Fri, 23 Mar 2007 01:00:00 GMT\n", " '''\n", " local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')\n", " gmt = local.to('utc')\n", " return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n", "\n", "def parse_links_from_headers(response):\n", " '''\n", " Extract Memento links from 'Link' header.\n", " '''\n", " links = response.links\n", " return {k: v['url'] for k, v in links.items()}\n", "\n", "def query_timegate(timegate, url, date=None, tz='Australia/Canberra'):\n", " '''\n", " Query the specified repository for a Memento.\n", " '''\n", " headers = {}\n", " if date:\n", " formatted_date = format_date_for_headers(date, tz)\n", " headers['Accept-Datetime'] = formatted_date\n", " # BL & NLNZ don't seem to default to latest date if no date supplied\n", " elif not date and timegate in ['bl', 'nlnz']:\n", " formatted_date = format_date_for_headers(arrow.utcnow().format('YYYY-MM-DD'), tz)\n", " headers['Accept-Datetime'] = formatted_date\n", " # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n", " tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'\n", " # print(tg_url)\n", " # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n", " if timegate == 'ia':\n", " allow_redirects = True\n", " else:\n", " allow_redirects = False\n", " response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n", " return parse_links_from_headers(response)\n", "\n", "def get_memento(timegate, url, date=None, tz='Australia/Canberra'):\n", " '''\n", " If there's no memento in the results, look for an alternative.\n", " '''\n", " links = query_timegate(timegate, url, date, tz)\n", " # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n", " if links:\n", " if 'memento' in links:\n", " memento = links['memento']\n", " elif 'prev memento' in links:\n", " memento = links['prev memento']\n", " elif 'next memento' in links:\n", " memento = links['next memento']\n", " elif 'last memento' in links:\n", " memento = links['last memento']\n", " else:\n", " memento = None\n", " return memento" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "Query NZWA Timegate for information about the NLNZ home page." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'original': 'http://natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',\n", " 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/',\n", " 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',\n", " 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/http://natlib.govt.nz/'}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('nzwa', 'http://natlib.govt.nz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a version of my blog from around 2005. First from the AWA:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.archive.org.au/awa/20041126212006mp_/http://www.discontents.com.au:80/'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento('awa', 'http://discontents.com.au', '2005-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then from the IA:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.archive.org/web/20041126212006/http://www.discontents.com.au:80/'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento('ia', 'http://discontents.com.au', '2005-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io).\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }