{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting data from web archives using Memento\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across five web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, the Internet Archive, and the UK Government Web Archive. In particular we'll examine:\n", "\n", "* [Timegates](#Timegates) – request web page captures from (around) a particular date\n", "* [Timemaps](#Timemaps) – request a list of web archive captures from a particular url\n", "* [Mementos](#Mementos) – use url modifiers to change the way an archived web page is presented\n", "\n", "Notebooks using Timegates or Timemaps to access capture data include:\n", "\n", "* [Get the archived version of a page closest to a particular date](get_a_memento.ipynb)\n", "* [Find all the archived versions of a web page](find_all_captures.ipynb)\n", "* [Harvesting collections of text from archived web pages](getting_text_from_web_pages.ipynb)\n", "* [Compare two versions of an archived web page](show_diffs.ipynb)\n", "* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)\n", "* [Using screenshots to visualise change in a page over time](screenshots_over_time_using_timemaps.ipynb)\n", "* [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb)\n", "* [Find when a piece of text appears in an archived web page](find-text-in-page-from-timemap.ipynb)\n", "\n", "## Useful tools and documentation\n", "* [Memento Protocol Specification](https://tools.ietf.org/html/rfc7089)\n", "* [Pywb Memento implementation](https://pywb.readthedocs.io/en/latest/manual/memento.html)\n", "* [Memento support in IA Wayback](https://ws-dl.blogspot.com/2013/07/2013-07-15-wayback-machine-upgrades.html)\n", "* [Time Travel APIs](https://timetravel.mementoweb.org/guide/api/)\n", "* [Memento Compliance Audit of PyWB](https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html)\n", "* [Memento tools](http://mementoweb.org/tools/)\n", "* [Memento client](https://github.com/mementoweb/py-memento-client)\n", "* [Memgator](https://github.com/oduwsdl/MemGator) – Memento aggregator" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "\n", "import arrow\n", "import requests\n", "\n", "# Alternatively use the python Memento client" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " \"awa\": \"https://web.archive.org.au/awa/\",\n", " \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"ukwa\": \"https://www.webarchive.org.uk/wayback/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", " \"ukgwa\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Timegates\n", "\n", "Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the `Accept-Datetime` value in the headers of your request.\n", "\n", "For example, if you wanted to query the Australian Web Archive to find the version of `http://nla.gov.au/` that was captured as close as possible to 1 January 2001, you'd set the `Accept-Datetime` header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:\n", "\n", "```\n", "https://web.archive.org.au/awa/http://nla.gov.au/\n", "```\n", "\n", "A `get` request will return the captured page, but if all you want is the url of the archived page you can use a `head` request and extract the information you need from the response headers. Try this:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Server': 'nginx', 'Date': 'Thu, 23 Mar 2023 15:03:12 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144751/http://www.nla.gov.au/', 'Link': '