{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get the archived version of a page closest to a particular date\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "To get the archived version of a page closest to a particular date we can use the Memento API. Variations in the way Memento is implemented across repositories are documented in [Getting data from web archives using Memento](memento.ipynb). The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.\n", "\n", "To get information about available Mementos:\n", "\n", "``` python\n", "query_timegate([timegate], [url], [date], [timezone])\n", "```\n", "\n", "To get a single Memento closest to your target date:\n", "\n", "``` python\n", "get_memento([timegate], [url], [date], [timezone])\n", "```\n", "\n", "Parameters:\n", "\n", "* `timegate` – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n", "* `url` – the url you want to look for in the archive\n", "* `date` – the target date in ISO format, 'YYYY-MM-DD' (optional, will default to most recent date)\n", "* `tz` – a timezone string for your local timezone (optional)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arrow\n", "import requests" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " \"awa\": \"https://web.archive.org.au/awa/\",\n", " \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"ukwa\": \"https://www.webarchive.org.uk/wayback/en/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", "}\n", "\n", "\n", "def format_date_for_headers(iso_date, tz):\n", " \"\"\"\n", " Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n", " Convert the datetime to UTC and format as required by Accet-Datetime headers:\n", " eg Fri, 23 Mar 2007 01:00:00 GMT\n", " \"\"\"\n", " local = arrow.get(f\"{iso_date} 12:00:00 {tz}\", \"YYYY-MM-DD HH:mm:ss ZZZ\")\n", " gmt = local.to(\"utc\")\n", " return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n", "\n", "\n", "def parse_links_from_headers(response):\n", " \"\"\"\n", " Extract Memento links from 'Link' header.\n", " \"\"\"\n", " links = response.links\n", " return {k: v[\"url\"] for k, v in links.items()}\n", "\n", "\n", "def query_timegate(timegate, url, date=None, tz=\"Australia/Canberra\"):\n", " \"\"\"\n", " Query the specified repository for a Memento.\n", " \"\"\"\n", " headers = {}\n", " if date:\n", " formatted_date = format_date_for_headers(date, tz)\n", " headers[\"Accept-Datetime\"] = formatted_date\n", " elif not date:\n", " formatted_date = format_date_for_headers(\n", " arrow.utcnow().format(\"YYYY-MM-DD\"), tz\n", " )\n", " headers[\"Accept-Datetime\"] = formatted_date\n", " # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n", " tg_url = (\n", " f\"{TIMEGATES[timegate]}{url}/\"\n", " if not url.endswith(\"/\")\n", " else f\"{TIMEGATES[timegate]}{url}\"\n", " )\n", " # print(tg_url)\n", " # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n", " if timegate == \"ia\":\n", " allow_redirects = True\n", " else:\n", " allow_redirects = False\n", " response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n", " return parse_links_from_headers(response)\n", "\n", "\n", "def get_memento(timegate, url, date=None, tz=\"Australia/Canberra\"):\n", " \"\"\"\n", " If there's no memento in the results, look for an alternative.\n", " \"\"\"\n", " links = query_timegate(timegate, url, date, tz)\n", " # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n", " if links:\n", " if \"memento\" in links:\n", " memento = links[\"memento\"]\n", " elif \"prev memento\" in links:\n", " memento = links[\"prev memento\"]\n", " elif \"next memento\" in links:\n", " memento = links[\"next memento\"]\n", " elif \"last memento\" in links:\n", " memento = links[\"last memento\"]\n", " else:\n", " memento = None\n", " return memento" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "Query NZWA Timegate for information about the NLNZ home page." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'original': 'http://natlib.govt.nz/',\n", " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/',\n", " 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20220401095649mp_/http://natlib.govt.nz/'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate(\"nzwa\", \"http://natlib.govt.nz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a version of my blog from around 2005. First from the AWA:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.archive.org.au/awa/20041126212006mp_/http://www.discontents.com.au:80/'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"awa\", \"http://discontents.com.au\", \"2005-01-01\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then from the IA:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.archive.org/web/20041126212006/http://www.discontents.com.au:80/'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"ia\", \"http://discontents.com.au\", \"2005-01-01\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a recent memento of the British Library site:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.webarchive.org.uk/wayback/en/archive/20220405104602mp_/https://www.bl.uk/'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"ukwa\", \"http://bl.uk\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }