{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting data from web archives using Memento\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across four web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, and the Internet Archive. In particular we'll examine:\n", "\n", "* [Timegates](#Timegates) – request web page captures from (around) a particular date\n", "* [Timemaps](#Timemaps) – request a list of web archive captures from a particular url\n", "* [Mementos](#Mementos) – use url modifiers to change the way an archived web page is presented\n", "\n", "Notebooks using Timegates or Timemaps to access capture data include:\n", "\n", "* [Get the archived version of a page closest to a particular date](get_a_memento.ipynb)\n", "* [Find all the archived versions of a web page](find_all_captures.ipynb)\n", "* [Harvesting collections of text from archived web pages](getting_text_from_web_pages.ipynb)\n", "* [Compare two versions of an archived web page](show_diffs.ipynb)\n", "* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)\n", "* [Using screenshots to visualise change in a page over time](screenshots_over_time_using_timemaps.ipynb)\n", "* [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb)\n", "* [Find when a piece of text appears in an archived web page](find-text-in-page-from-timemap.ipynb)\n", "\n", "## Useful tools and documentation\n", "* [Memento Protocol Specification](https://tools.ietf.org/html/rfc7089)\n", "* [Pywb Memento implementation](https://pywb.readthedocs.io/en/latest/manual/memento.html)\n", "* [Memento support in IA Wayback](https://ws-dl.blogspot.com/2013/07/2013-07-15-wayback-machine-upgrades.html)\n", "* [Time Travel APIs](https://timetravel.mementoweb.org/guide/api/)\n", "* [Memento Compliance Audit of PyWB](https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html)\n", "* [Memento tools](http://mementoweb.org/tools/)\n", "* [Memento client](https://github.com/mementoweb/py-memento-client)\n", "* [Memgator](https://github.com/oduwsdl/MemGator) – Memento aggregator" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import arrow\n", "import re\n", "import json\n", "\n", "# Alternatively use the python Memento client " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " 'awa': 'https://web.archive.org.au/awa/',\n", " 'nzwa': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',\n", " 'ukwa': 'https://www.webarchive.org.uk/wayback/archive/',\n", " 'ia': 'https://web.archive.org/web/'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Timegates\n", "\n", "Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the `Accept-Datetime` value in the headers of your request.\n", "\n", "For example, if you wanted to query the Australian Web Archive to find the version of `http://nla.gov.au/` that was captured as close as possible to 1 January 2001, you'd set the `Accept-Datetime` header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:\n", "\n", "```\n", "https://web.archive.org.au/awa/http://nla.gov.au/\n", "```\n", "\n", "A `get` request will return the captured page, but if all you want is the url of the archived page you can use a `head` request and extract the information you need from the response headers. Try this:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Server': 'nginx', 'Date': 'Fri, 22 May 2020 02:40:23 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 'Link': '; rel=\"original\", ; rel=\"timegate\", ; rel=\"timemap\"; type=\"application/link-format\", ; rel=\"memento\"; datetime=\"Fri, 05 Feb 2010 14:42:27 GMT\"', 'Vary': 'accept-datetime'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response = requests.head('https://web.archive.org.au/awa/http://nla.gov.au/', headers={'Accept-Datetime': 'Fri, 01 Jan 2010 01:00:00 GMT'})\n", "response.headers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The request above returns the following headers:\n", "\n", "``` python\n", "{\n", " 'Server': 'nginx', \n", " 'Date': 'Wed, 06 May 2020 04:34:50 GMT', \n", " 'Content-Length': '0', 'Connection': 'keep-alive', \n", " 'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', \n", " 'Link': '; rel=\"original\", ; rel=\"timegate\", ; rel=\"timemap\"; type=\"application/link-format\", ; rel=\"memento\"; datetime=\"Fri, 05 Feb 2010 14:42:27 GMT\"', \n", " 'Vary': 'accept-datetime'\n", "}\n", "```\n", "\n", "The `Link` parameter contains the Memento information. You can see that it's actually providing information on four types of link:\n", "\n", "* the `original` url (ie the url that was archived) – ``\n", "* the `timegate` for the harvested url (which us what we just used) – ``\n", "* the `timemap` for the harvested url (we'll look at this below) – ``\n", "* the `memento` – ``\n", "\n", "The `memento` link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as `first memento`, `last memento`, `prev memento`, and `next memento`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's some functions to query a timegate in one of the four systems we're exploring. We'll use them to compare the results we get from each." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def format_date_for_headers(iso_date, tz):\n", " '''\n", " Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.\n", " Convert the datetime to UTC and format as required by Accet-Datetime headers:\n", " eg Fri, 23 Mar 2007 01:00:00 GMT\n", " '''\n", " local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')\n", " gmt = local.to('utc')\n", " return f'{gmt.format(\"ddd, DD MMM YYYY HH:mm:ss\")} GMT'\n", "\n", "def parse_links_from_headers(response):\n", " '''\n", " Extract original, timegate, timemap, and memento links from 'Link' header.\n", " '''\n", " links = response.links\n", " return {k: v['url'] for k, v in links.items()}\n", "\n", "def format_timestamp(timestamp, date_format='YYYY-MM-DD HH:mm:ss'):\n", " return arrow.get(timestamp, 'YYYYMMDDHHmmss').format(date_format)\n", "\n", "def query_timegate(timegate, url, date=None, tz='Australia/Canberra', request_type='head', allow_redirects=True):\n", " headers = {}\n", " if date:\n", " formatted_date = format_date_for_headers(date, tz)\n", " headers['Accept-Datetime'] = formatted_date\n", " # Note that you don't get a timegate response if you leave off the trailing slash\n", " tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'\n", " print(tg_url)\n", " if request_type == 'head':\n", " response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n", " else:\n", " response = requests.get(tg_url, headers=headers, allow_redirects=allow_redirects)\n", " # print(response.headers)\n", " return parse_links_from_headers(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Australian Web Archive\n", "\n", "A `HEAD` request that follows redirects returns no results" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org.au/awa/http://www.nla.gov.au/\n" ] }, { "data": { "text/plain": [ "{}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('awa', 'http://www.nla.gov.au')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "A `HEAD` request that doesn't follow redirects returns results as expected" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org.au/awa/http://www.nla.gov.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',\n", " 'timegate': 'https://web.archive.org.au/awa/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',\n", " 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',\n", " 'memento': 'https://web.archive.org.au/awa/20200305172547mp_/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('awa', 'http://www.nla.gov.au', allow_redirects=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "A query without an `Accept-Datetime` value returns a recent capture." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org.au/awa/http://www.nla.gov.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',\n", " 'timegate': 'https://web.archive.org.au/awa/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',\n", " 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',\n", " 'memento': 'https://web.archive.org.au/awa/20200305172547mp_/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('awa', 'http://www.nla.gov.au', allow_redirects=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A query with an `Accept-Datetime` value of 1 January 2002 returns a capture from 20 January 2002." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org.au/awa/http://www.education.gov.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://www.education.gov.au:80/',\n", " 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',\n", " 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',\n", " 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('awa', 'http://www.education.gov.au/', date='2002-01-01', allow_redirects=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Using a `GET` rather than a `HEAD` request returns no Memento information when redirects are followed." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org.au/awa/http://www.education.gov.au/\n" ] }, { "data": { "text/plain": [ "{}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('awa', 'http://www.education.gov.au/', date='2002-01-01', request_type='get')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Using a `GET` rather than a `HEAD` request returns Memento information when redirects are not followed." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org.au/awa/http://www.education.gov.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://www.education.gov.au:80/',\n", " 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',\n", " 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',\n", " 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('awa', 'http://www.education.gov.au/', date='2002-01-01', request_type='get', allow_redirects=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New Zealand Web Archive\n", "\n", "Changing whether or not redirects are followed has no effect on any of these responses.\n", "\n", "A query without an `Accept-Datetime` value doesn't return a `memento`, but does include `first memento`, `last memento`, and `prev memento`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/\n" ] }, { "data": { "text/plain": [ "{'original': 'https://natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/https://natlib.govt.nz/',\n", " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/https://natlib.govt.nz/',\n", " 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/https://natlib.govt.nz/',\n", " 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/https://natlib.govt.nz/',\n", " 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/https://natlib.govt.nz/'}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('nzwa', 'http://natlib.govt.nz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A query with an `Accept-Datetime` value of 1 January 2005 doesn't return a `memento`, even though there's a capture available from July 2004. I don't know why this is." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://www.natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',\n", " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',\n", " 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',\n", " 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20060704033135/http://www.natlib.govt.nz/',\n", " 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('nzwa', 'http://natlib.govt.nz', date='2005-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A query with an `Accept-Datetime` value of 1 January 2008 returns a `memento` from 25 February 2008, as well as `first memento`, `last memento`, `prev memento`, and `next memento`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://www.natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',\n", " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',\n", " 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',\n", " 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://www.natlib.govt.nz/',\n", " 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://www.natlib.govt.nz/',\n", " 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://www.natlib.govt.nz/',\n", " 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('nzwa', 'http://natlib.govt.nz', date='2008-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A `GET` request returns the same results as a `HEAD` request." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://www.natlib.govt.nz/',\n", " 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',\n", " 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',\n", " 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',\n", " 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://www.natlib.govt.nz/',\n", " 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://www.natlib.govt.nz/',\n", " 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://www.natlib.govt.nz/',\n", " 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('nzwa', 'http://natlib.govt.nz', date='2008-01-01', request_type='get')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Internet Archive\n", "\n", "Using a `HEAD` request that follows redirects returns results as expected." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org/web/http://discontents.com.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'https://discontents.com.au/',\n", " 'timemap': 'https://web.archive.org/web/timemap/link/https://discontents.com.au/',\n", " 'timegate': 'https://web.archive.org/web/https://discontents.com.au/',\n", " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n", " 'prev memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/',\n", " 'memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/',\n", " 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ia', 'http://discontents.com.au')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Using a `HEAD` request returns no Memento information if redirects are not followed." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org/web/http://discontents.com.au/\n" ] }, { "data": { "text/plain": [ "{}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ia', 'http://discontents.com.au', allow_redirects=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A query without an `Accept-Datetime` value returns a `memento` and also includes a `first memento`, `last memento`, `prev memento`, and `last memento`. It seems that the `memento` returned is the second last capture." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org/web/http://discontents.com.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'https://discontents.com.au/',\n", " 'timemap': 'https://web.archive.org/web/timemap/link/https://discontents.com.au/',\n", " 'timegate': 'https://web.archive.org/web/https://discontents.com.au/',\n", " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n", " 'prev memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/',\n", " 'memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/',\n", " 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ia', 'http://discontents.com.au')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A query with an `Accept-Datetime` value of 1 January 2010 returns a `memento` from 4 September 2010, even though the `prev memento` date, 30 October 2009, is closer." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org/web/http://discontents.com.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://discontents.com.au:80/',\n", " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',\n", " 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',\n", " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n", " 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',\n", " 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',\n", " 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',\n", " 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ia', 'http://discontents.com.au', date='2010-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "`GET` requests return different results if redirects are not followed." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org/web/http://discontents.com.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://discontents.com.au:80/',\n", " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',\n", " 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',\n", " 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',\n", " 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',\n", " 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',\n", " 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',\n", " 'last memento': 'https://web.archive.org/web/20200510215639/https://discontents.com.au/'}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ia', 'http://discontents.com.au', date='2010-01-01', request_type='get')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://web.archive.org/web/http://discontents.com.au/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://discontents.com.au/',\n", " 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au/',\n", " 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/'}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ia', 'http://discontents.com.au', date='2010-01-01', request_type='get', allow_redirects=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### UK Web Archive\n", "\n", "Changing whether or not redirects are followed has no effect on any of these responses.\n", "\n", "A query without an `Accept-Datetime` value doesn't return a `memento`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://bl.uk/',\n", " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',\n", " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/'}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ukwa', 'http://bl.uk')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A query with an `Accept-Datetime` value of 1 January 2006 returns a `memento` from 1 January 2006. However, this date doesn't seem to represent an actual capture. There seems to be a problem with the Timegate." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://bl.uk/',\n", " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',\n", " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',\n", " 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ukwa', 'http://bl.uk', date='2006-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A `GET` request returns the same results as a `HEAD` request." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.webarchive.org.uk/wayback/archive/http://bl.uk/\n" ] }, { "data": { "text/plain": [ "{'original': 'http://bl.uk/',\n", " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',\n", " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',\n", " 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ukwa', 'http://bl.uk', date='2006-01-01', request_type='get')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarising the differences\n", "\n", "As you can see above, there are a couple of significant differences in the way that Timegates behave across the four repositories.\n", "\n", "* The Wayback systems (IA and NZWA) provide more information than the Pywb systems (`first memento`, `last memento`, `prev memento`, and `last memento`)\n", "* The UKWA and NZWA don't return a `memento` unless you include a date in the `Accept-Datetime` header. The NLA and IA return a recently captured `memento` as a default. (Though no necessarily the *most* recent?)\n", "* You can use either `HEAD` or `GET` with UKWA and NZWA, but IA and AWA behave different depending on the type of request and whether redirects are followed. To get results from either a `HEAD` or `GET` request, AWA requests should not follow redirects. To get results from a `HEAD` requests, IA requests should follow redirects. `GET` requests to IA will return results whether or not redirects are allowed, however, those results differ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalising Timegate responses and queries\n", "\n", "Here's some code to smooth out the differences between systems, and return Memento data as a Python dictionary. Specifically it:\n", "\n", "* Inserts the current date into requests from the UKWA or NLNZ if no date is specified. This means they behave like the other repositories that return a recent Memento.\n", "* Follows redirects for requests to the IA.\n", "* If there is no `memento` value in the response (as sometimes happens with NLNZ), it looks for a `first`, `last`, `prev` or `next` value instead." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def query_timegate(timegate, url, date=None, tz='Australia/Canberra'):\n", " '''\n", " Query the specified repository for a Memento.\n", " '''\n", " headers = {}\n", " if date:\n", " formatted_date = format_date_for_headers(date, tz)\n", " headers['Accept-Datetime'] = formatted_date\n", " # BL & NLNZ don't seem to default to latest date if no date supplied\n", " elif not date and timegate in ['bl', 'nlnz']:\n", " formatted_date = format_date_for_headers(arrow.utcnow().format('YYYY-MM-DD'), tz)\n", " headers['Accept-Datetime'] = formatted_date\n", " # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!\n", " tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'\n", " # print(tg_url)\n", " # IA only works if redirects are followed -- this defaults to False with HEAD requests...\n", " if timegate == 'ia':\n", " allow_redirects = True\n", " else:\n", " allow_redirects = False\n", " response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)\n", " return parse_links_from_headers(response)\n", "\n", "def get_memento(timegate, url, date=None, tz='Australia/Canberra'):\n", " '''\n", " If there's no memento in the results, look for an alternative.\n", " '''\n", " links = query_timegate(timegate, url, date, tz)\n", " # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness\n", " if links:\n", " if 'memento' in links:\n", " memento = links['memento']\n", " elif 'prev memento' in links:\n", " memento = links['prev memento']\n", " elif 'next memento' in links:\n", " memento = links['next memento']\n", " elif 'last memento' in links:\n", " memento = links['last memento']\n", " else:\n", " memento = None\n", " return memento" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can request a Memento from any of the four repositories and get back the results as a Python dictionary. You can see this code in action in the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'original': 'http://bl.uk/',\n", " 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',\n", " 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',\n", " 'memento': 'https://www.webarchive.org.uk/wayback/archive/20150101010000mp_/http://bl.uk/'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_timegate('ukwa', 'http://bl.uk', date='2015-01-01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or if we just want to get the url for a Memento (and fallback to alternative values if `memento` is missing)." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/http://natlib.govt.nz/'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_memento('nzwa', 'http://natlib.govt.nz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## Timemaps\n", "\n", "Memento Timemaps provide machine-processable lists of web page captures from a particular archive. They are available from both OpenWayback and Pywb systems, though there are some differences. The [Pywb documentation](https://pywb.readthedocs.io/en/latest/manual/memento.html#timemap-api) notes that the following formats are available:\n", "\n", "* link – returns an application/link-format as required by the Memento spec\n", "* cdxj – returns a timemap in the native CDXJ format\n", "* json – returns the timemap as newline-delimited JSON lines (NDJSON) format\n", "\n", "Timemaps are requested using a url with the following format:\n", "\n", "```\n", "http://[address.of.archive]/[collection]/timemap/[format]/[web page url]\n", "```\n", "\n", "So if you wanted to query the Australian Web Archive to get a list of captures in JSON format from http://nla.gov.au/ you'd use this url:\n", "\n", "```\n", "https://web.archive.org.au/awa/timemap/json/http://nla.gov.au/\n", "```\n", "\n", "The examples below show how the format and behaviour of Timemaps vary slightly across the four respoitories we're interested in." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def get_timemap(timegate, url, format='json'):\n", " '''\n", " Basic function to get a Timemap for the supplied url.\n", " '''\n", " tg_url = f'{TIMEGATES[timegate]}timemap/{format}/{url}/'\n", " response = requests.get(tg_url)\n", " # Show the content-type\n", " print(response.headers['content-type'])\n", " return response.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### National Library of Australia\n", "\n", "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/link-format\n", "; rel=\"self\"; type=\"application/link-format\"; from=\"Wed, 06 Dec 2000 21:15:00 GMT\",\n", "; rel=\"timegate\",\n", "; rel=\"original\",\n", "; rel=\"memento\"; datetime=\"Wed, 06 Dec 2000 21:15:00 GMT\"; collection=\"awa\",\n", "; rel=\"memento\"; datetime=\"Thu, 18 Jan 2001 20:36:00 GMT\"; collection=\"awa\",\n" ] } ], "source": [ "timemap = get_timemap('awa', 'http://www.gov.au', 'link')\n", "# Show the first 5 lines\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/x-ndjson\n", "{\"urlkey\": \"au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm\", \"timestamp\": \"20031122074837\", \"url\": \"http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE\", \"offset\": \"97170362\", \"filename\": \"NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz\", \"source\": \"awa\", \"source-coll\": \"awa\"}\n" ] } ], "source": [ "timemap = get_timemap('awa', 'http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm', 'json')\n", "# Show the first line\n", "print('\\n'.join(timemap.splitlines()[:1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/x-cdxj\n", "au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm 20031122074837 {\"url\": \"http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE\", \"offset\": \"97170362\", \"filename\": \"NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz\", \"source\": \"awa\", \"source-coll\": \"awa\"}\n" ] } ], "source": [ "timemap = get_timemap('awa', 'http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm', 'cdxj')\n", "# Show the first line\n", "print('\\n'.join(timemap.splitlines()[:1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### UK Web Archive\n", "\n", "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/link-format\n", "; rel=\"self\"; type=\"application/link-format\"; from=\"Tue, 30 Oct 2001 00:00:19 GMT\",\n", "; rel=\"timegate\",\n", "; rel=\"original\",\n", "; rel=\"memento\"; datetime=\"Tue, 30 Oct 2001 00:00:19 GMT\"; collection=\"archive\",\n", "; rel=\"memento\"; datetime=\"Tue, 13 Nov 2001 00:00:00 GMT\"; collection=\"archive\",\n" ] } ], "source": [ "timemap = get_timemap('ukwa', 'http://bl.uk', 'link')\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/x-ndjson\n", "{\"urlkey\": \"uk,bl)/\", \"timestamp\": \"20011030000019\", \"url\": \"http://www.bl.uk/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"0\", \"offset\": \"10813988\", \"filename\": \"/data/102148/31031347/WARCS/BL-31031347.warc.gz\", \"load_url\": \"https://www.webarchive.org.uk/wayback/archive/20011030000019id_/http://www.bl.uk/\", \"source\": \"archive\", \"source-coll\": \"archive\", \"access\": \"allow\"}\n" ] } ], "source": [ "timemap = get_timemap('ukwa', 'http://bl.uk', 'json')\n", "print('\\n'.join(timemap.splitlines()[:1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/x-cdxj\n", "uk,bl)/ 20011030000019 {\"url\": \"http://www.bl.uk/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW\", \"redirect\": \"-\", \"robotflags\": \"-\", \"length\": \"0\", \"offset\": \"10813988\", \"filename\": \"/data/102148/31031347/WARCS/BL-31031347.warc.gz\", \"load_url\": \"https://www.webarchive.org.uk/wayback/archive/20011030000019id_/http://www.bl.uk/\", \"source\": \"archive\", \"source-coll\": \"archive\", \"access\": \"allow\"}\n" ] } ], "source": [ "timemap = get_timemap('ukwa', 'http://bl.uk', 'cdxj')\n", "print('\\n'.join(timemap.splitlines()[:1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### National Library of New Zealand\n", "\n", "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/link-format\n", "; rel=\"original\",\n", "; rel=\"self\"; type=\"application/link-format\"; from=\"Sun, 11 Jul 2004 21:32:25 GMT\"; until=\"Thu, 30 Jan 2020 06:01:11 GMT\",\n", "; rel=\"timegate\",\n", "; rel=\"first memento\"; datetime=\"Sun, 11 Jul 2004 21:32:25 GMT\",\n", "; rel=\"memento\"; datetime=\"Tue, 04 Jul 2006 03:31:35 GMT\",\n" ] } ], "source": [ "timemap = get_timemap('nzwa', 'http://natlib.govt.nz', 'link')\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "A request for a Timemap in `json` returns results in `link` format. OpenWayback only supports the `link` format." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/link-format\n", "; rel=\"original\",\n", "; rel=\"self\"; type=\"application/link-format\"; from=\"Sun, 11 Jul 2004 21:32:25 GMT\"; until=\"Thu, 30 Jan 2020 06:01:11 GMT\",\n", "; rel=\"timegate\",\n", "; rel=\"first memento\"; datetime=\"Sun, 11 Jul 2004 21:32:25 GMT\",\n", "; rel=\"memento\"; datetime=\"Tue, 04 Jul 2006 03:31:35 GMT\",\n" ] } ], "source": [ "timemap = get_timemap('nzwa', 'http://natlib.govt.nz', 'json')\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Internet Archive\n", "\n", "Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/link-format\n", "; rel=\"original\",\n", "; rel=\"self\"; type=\"application/link-format\"; from=\"Sun, 06 Dec 1998 01:22:33 GMT\",\n", "; rel=\"timegate\",\n", "; rel=\"first memento\"; datetime=\"Sun, 06 Dec 1998 01:22:33 GMT\",\n", "; rel=\"memento\"; datetime=\"Sat, 12 Dec 1998 02:44:10 GMT\",\n" ] } ], "source": [ "timemap = get_timemap('ia', 'http://discontents.com.au', 'link')\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Request for timemap in `json` format returns results in JSON as an array of arrays, where the first row provides the column headings. Response headers include `content-type` of `application/json`." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/json\n", "[[\"urlkey\",\"timestamp\",\"original\",\"mimetype\",\"statuscode\",\"digest\",\"redirect\",\"robotflags\",\"length\",\"offset\",\"filename\"],\n", "[\"au,com,discontents)/\",\"19981206012233\",\"http://www.discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1610\",\"43993900\",\"green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz\"],\n", "[\"au,com,discontents)/\",\"19981212024410\",\"http://www.discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1613\",\"17792789\",\"slash-913417727-c/slash-913430608.arc.gz\"],\n", "[\"au,com,discontents)/\",\"19990125094813\",\"http://www.discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1613\",\"11419234\",\"slash-913417727-c/slash_19990124232053-917257670.arc.gz\"],\n", "[\"au,com,discontents)/\",\"19990208004052\",\"http://discontents.com.au:80/\",\"text/html\",\"200\",\"FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36\",\"-\",\"-\",\"1612\",\"13269748\",\"slash-913417727-c/slash-918434425.arc.gz\"],\n" ] } ], "source": [ "timemap = get_timemap('ia', 'http://discontents.com.au', 'json')\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Request for timemap in `cdxj` returns results in plain text, with fields separated by spaces, and captures separated by line breaks. Response headers include `content-type` of `text/plain`." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/plain\n", "au,com,discontents)/ 19981206012233 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1610 43993900 green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz\n", "au,com,discontents)/ 19981212024410 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1613 17792789 slash-913417727-c/slash-913430608.arc.gz\n", "au,com,discontents)/ 19990125094813 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1613 11419234 slash-913417727-c/slash_19990124232053-917257670.arc.gz\n", "au,com,discontents)/ 19990208004052 http://discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1612 13269748 slash-913417727-c/slash-918434425.arc.gz\n", "au,com,discontents)/ 19990208012714 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1613 17395194 slash-913417727-c/slash-918437200.arc.gz\n" ] } ], "source": [ "timemap = get_timemap('ia', 'http://discontents.com.au', 'cdxj')\n", "print('\\n'.join(timemap.splitlines()[:5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Differences in field labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we compare the Pywb JSON output with the IA Wayback output, we see there are also some differences in the field labels. In particular `original` in IA Wayback is just `url` in Pywb, while `statuscode` and `mimetype` are shortened to `status` and `mime` in Pywb." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/json\n" ] }, { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename']" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timemap = get_timemap('ia', 'http://bl.uk', 'json')\n", "data = json.loads(timemap)\n", "data[0]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/x-ndjson\n" ] }, { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'url',\n", " 'mime',\n", " 'status',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename',\n", " 'load_url',\n", " 'source',\n", " 'source-coll',\n", " 'access']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timemap = get_timemap('ukwa', 'http://bl.uk', 'json')\n", "data = [json.loads(line) for line in timemap.splitlines()]\n", "list(data[0].keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarising the differences\n", "\n", "The good news is that all repositories provide Timemaps in the standard `link` format as required by the Memento specification. However, there's more varation when it comes to other formats.\n", "\n", "* NLNZ only provides the `link` format.\n", "* IA's `json` format is different to the Pywb format from UKWA and NLA. \n", "* IA uses different labels for some values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalising Timemaps\n", "\n", "With the information above we can construct some functions to return normalised Timemap results as JSON. To do this we need to:\n", "\n", "* Convert the `link` format from NLNZ to JSON\n", "* Restructure the JSON output from IA to match the Pywb format\n", "* Change some of the column headings in the IA data to match the Pywb format\n", "\n", "Because the `link` format provides less information than the `json` format, we could also try to enrich the NLNZ data by requesting more information about individual Mementos." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def convert_lists_to_dicts(results):\n", " '''\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " '''\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " for d in results_as_dicts:\n", " d['status'] = d.pop('statuscode')\n", " d['mime'] = d.pop('mimetype')\n", " d['url'] = d.pop('original')\n", " return results_as_dicts\n", "\n", "def get_capture_data_from_memento(url, request_type='head'):\n", " '''\n", " For OpenWayback systems this can get some extra capture info to insert into Timemaps.\n", " '''\n", " if request_type == 'head':\n", " response = requests.head(url)\n", " else:\n", " response = requests.get(url)\n", " headers = response.headers\n", " length = headers.get('x-archive-orig-content-length')\n", " status = headers.get('x-archive-orig-status')\n", " status = status.split(' ')[0] if status else None\n", " mime = headers.get('x-archive-orig-content-type')\n", " mime = mime.split(';')[0] if mime else None\n", " return {'length': length, 'status': status, 'mime': mime}\n", "\n", "def convert_link_to_json(results, enrich_data=False):\n", " '''\n", " Converts link formatted Timemap to JSON.\n", " '''\n", " data = []\n", " for line in results.splitlines():\n", " parts = line.split('; ')\n", " if len(parts) > 1:\n", " link_type = re.search(r'rel=\"(original|self|timegate|first memento|last memento|memento)\"', parts[1]).group(1)\n", " if link_type == 'memento':\n", " link = parts[0].strip('<>')\n", " timestamp, original = re.search(r'/(\\d{14})/(.*)$', link).groups()\n", " capture = {'timestamp': timestamp, 'url': original}\n", " if enrich_data:\n", " capture.update(get_capture_data_from_memento(link))\n", " print(capture)\n", " data.append(capture)\n", " return data\n", " \n", "def get_timemap_as_json(timegate, url):\n", " '''\n", " Get a Timemap then normalise results (if necessary) to return a list of dicts.\n", " '''\n", " tg_url = f'{TIMEGATES[timegate]}timemap/json/{url}/'\n", " response = requests.get(tg_url)\n", " response_type = response.headers['content-type']\n", " print(response_type)\n", " if response_type == 'text/x-ndjson':\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " elif response_type == 'application/json':\n", " data = convert_lists_to_dicts(response.json())\n", " elif response_type in ['application/link-format', 'text/html;charset=utf-8']:\n", " data = convert_link_to_json(response.text)\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now we can get information about captures in a standardised JSON format from all four repositories. Although, we can't rely on NLNZ data having anything more than `timestamp` and `url` for each capture. You can see this in action in the [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb) notebook" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text/x-ndjson\n" ] }, { "data": { "text/plain": [ "{'urlkey': 'uk,bl)/',\n", " 'timestamp': '20011030000019',\n", " 'url': 'http://www.bl.uk/',\n", " 'mime': 'text/html',\n", " 'status': '200',\n", " 'digest': 'JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW',\n", " 'redirect': '-',\n", " 'robotflags': '-',\n", " 'length': '0',\n", " 'offset': '10813988',\n", " 'filename': '/data/102148/31031347/WARCS/BL-31031347.warc.gz',\n", " 'load_url': 'https://www.webarchive.org.uk/wayback/archive/20011030000019id_/http://www.bl.uk/',\n", " 'source': 'archive',\n", " 'source-coll': 'archive',\n", " 'access': 'allow'}" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timemap = get_timemap_as_json('ukwa', 'http://bl.uk')\n", "timemap[0]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "application/json\n" ] }, { "data": { "text/plain": [ "{'urlkey': 'uk,bl)/',\n", " 'timestamp': '19970218190613',\n", " 'digest': 'Z42UMUL76GODKO3EMNSLXDTCST66VDAX',\n", " 'redirect': '-',\n", " 'robotflags': '-',\n", " 'length': '1208',\n", " 'offset': '19524651',\n", " 'filename': 'GR-001114-c/GR-002277.arc.gz',\n", " 'status': '200',\n", " 'mime': 'text/html',\n", " 'url': 'http://www.bl.uk:80/'}" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timemap = get_timemap_as_json('ia', 'http://bl.uk')\n", "timemap[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## Mementos\n", "\n", "You can also modify the url of a Memento to change the way it's presented. In particular, adding `id_` after the timestamp will tell the server that you want the original harvested version of the webpage, without any rewriting of links, or web archive navigation features. For example:\n", "\n", "```\n", "https://web.archive.org.au/awa/20200302223537id_/http://discontents.com.au/\n", "```\n", "\n", "This works with all four repositories, however, note that for the Australian Web Archive you need to use the `web.archive.org.au` domain, not `webarchive.nla.gov.au`.\n", "\n", "In addition, NLNZ and IA both support the `if_` option, which provides a view of the archived page without web archive headers navigation inserted, but with links to CSS, JS, and images rewritten to point to archived versions. This is as close as you can get to looking at the original page, and I've used it in the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook. Note that if you add `if_` to requests from the UKWA or the NLA you'll be redirected to the standard view with the original page framed by the web archive navigation.\n", "\n", "Pywb's page on [url rewriting](https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting) has some useful information about this.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io).\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }