{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get the archived version of a page closest to a particular date\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "To get the archived version of a page closest to a particular date we can use the Memento API. Variations in the way Memento is implemented across repositories are documented in [Getting data from web archives using Memento](memento.ipynb). The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, Internet Archive, and the UK Government Web Archive. They could be easily modified to work with other Memento-compliant repositories.\n", "\n", "To get information about available Mementos:\n", "\n", "``` python\n", "query_timegate([timegate], [url], [date], [timezone])\n", "```\n", "\n", "To get a single Memento closest to your target date:\n", "\n", "``` python\n", "get_memento([timegate], [url], [date], [timezone])\n", "```\n", "\n", "Parameters:\n", "\n", "* `timegate` – one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n", "* `url` – the url you want to look for in the archive\n", "* `date` – the target date in ISO format, 'YYYY-MM-DD' (optional, will default to most recent date)\n", "* `tz` – a timezone string for your local timezone (optional)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arrow\n", "import requests" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " \"awa\": \"https://web.archive.org.au/awa/\",\n", " \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"ukwa\": \"https://www.webarchive.org.uk/wayback/en/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", " \"ukgwa\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/\"\n", "}\n", "\n", "\n", "def format_date_for_headers(iso_date, tz):\n", " \"\"\"\n", " Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n", " Convert the datetime to UTC and format as required by Accet-Datetime headers:\n", " eg Fri, 23 Mar 2007 01:00:00 GMT\n", " \"\"\"\n", " local = arrow.get(f\"{iso_date} 12:00:00 {tz}\", \"YYYY-MM-DD HH:mm:ss ZZZ\")\n", " gmt = local.to(\"utc\")\n", " return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n", "\n", "\n", "def parse_links_from_headers(response):\n", " \"\"\"\n", " Extract Memento links from 'Link' header.\n", " \"\"\"\n", " links = response.links\n", " return {k: v[\"url\"] for k, v in links.items()}\n", "\n", "\n", "def query_timegate(timegate, url, date=None, tz=\"Australia/Canberra\"):\n", " \"\"\"\n", " Query the specified repository for a Memento.\n", " \"\"\"\n", " headers = {}\n", " if date:\n", " formatted_date = format_date_for_headers(date, tz)\n", " headers[\"Accept-Datetime\"] = formatted_date\n", " elif not date:\n", " formatted_date = format_date_for_headers(\n", " arrow.utcnow().format(\"YYYY-MM-DD\"), tz\n", " )\n", " headers[\"Accept-Datetime\"] = formatted_date\n", " # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n", " tg_url = (\n", " f\"{TIMEGATES[timegate]}{url}/\"\n", " if not url.endswith(\"/\")\n", " else f\"{TIMEGATES[timegate]}{url}\"\n", " )\n", " # print(tg_url)\n", " # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n", " if timegate == \"ia\":\n", " allow_redirects = True\n", " else:\n", " allow_redirects = False\n", " response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n", " return parse_links_from_headers(response)\n", "\n", "\n", "def get_memento(timegate, url, date=None, tz=\"Australia/Canberra\"):\n", " \"\"\"\n", " If there's no memento in the results, look for an alternative.\n", " \"\"\"\n", " links = query_timegate(timegate, url, date, tz)\n", " # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n", " if links:\n", " if \"memento\" in links:\n", " memento = links[\"memento\"]\n", " elif \"prev memento\" in links:\n", " memento = links[\"prev memento\"]\n", " elif \"next memento\" in links:\n", " memento = links[\"next memento\"]\n", " elif \"last memento\" in links:\n", " memento = links[\"last memento\"]\n", " else:\n", " memento = None\n", " return memento" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "Query NZWA Timegate for information about the NLNZ home page." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'original': 'http://natlib.govt.nz/',\n", " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/',\n", " 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20220801082654mp_/http://natlib.govt.nz/'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate(\"nzwa\", \"http://natlib.govt.nz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a version of my blog from around 2005. First from the AWA:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.archive.org.au/awa/20041126212006mp_/http://www.discontents.com.au:80/'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"awa\", \"http://discontents.com.au\", \"2005-01-01\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then from the IA:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.archive.org/web/20041126212006/http://www.discontents.com.au:80/'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"ia\", \"http://discontents.com.au\", \"2005-01-01\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a recent memento of the British Library site:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.webarchive.org.uk/wayback/en/archive/20230319105859mp_/https://www.bl.uk/'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"ukwa\", \"http://bl.uk\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a recent memento of gov.uk from the UK Government Web Archive:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://webarchive.nationalarchives.gov.uk/ukgwa/20220524084104mp_/http://www.mod.uk/'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento(\"ukgwa\", \"http://mod.uk\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }