# Getting data from web archives using Memento

<p class="alert alert-info">New to Jupyter notebooks? Try <a href="getting-started/Using_Jupyter_notebooks.ipynb"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>

Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across five web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, the Internet Archive, and the UK Government Web Archive. In particular we'll examine:

* [Timegates](#Timegates) – request web page captures from (around) a particular date
* [Timemaps](#Timemaps) – request a list of web archive captures from a particular url
* [Mementos](#Mementos) – use url modifiers to change the way an archived web page is presented

Notebooks using Timegates or Timemaps to access capture data include:

* [Get the archived version of a page closest to a particular date](get_a_memento.ipynb)
* [Find all the archived versions of a web page](find_all_captures.ipynb)
* [Harvesting collections of text from archived web pages](getting_text_from_web_pages.ipynb)
* [Compare two versions of an archived web page](show_diffs.ipynb)
* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)
* [Using screenshots to visualise change in a page over time](screenshots_over_time_using_timemaps.ipynb)
* [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb)
* [Find when a piece of text appears in an archived web page](find-text-in-page-from-timemap.ipynb)

## Useful tools and documentation
* [Memento Protocol Specification](https://tools.ietf.org/html/rfc7089)
* [Pywb Memento implementation](https://pywb.readthedocs.io/en/latest/manual/memento.html)
* [Memento support in IA Wayback](https://ws-dl.blogspot.com/2013/07/2013-07-15-wayback-machine-upgrades.html)
* [Time Travel APIs](https://timetravel.mementoweb.org/guide/api/)
* [Memento Compliance Audit of PyWB](https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html)
* [Memento tools](http://mementoweb.org/tools/)
* [Memento client](https://github.com/mementoweb/py-memento-client)
* [Memgator](https://github.com/oduwsdl/MemGator) – Memento aggregator

In [1]:
import json
import re

import arrow
import requests

# Alternatively use the python Memento client

In [2]:
# These are the repositories we'll be using
TIMEGATES = {
    "awa": "https://web.archive.org.au/awa/",
    "nzwa": "https://ndhadeliver.natlib.govt.nz/webarchive/",
    "ukwa": "https://www.webarchive.org.uk/wayback/archive/",
    "ia": "https://web.archive.org/web/",
    "ukgwa": "https://webarchive.nationalarchives.gov.uk/ukgwa/"
}

## Timegates

Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the `Accept-Datetime` value in the headers of your request.

For example, if you wanted to query the Australian Web Archive to find the version of `http://nla.gov.au/` that was captured as close as possible to 1 January 2001, you'd set the `Accept-Datetime` header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:

```
https://web.archive.org.au/awa/http://nla.gov.au/
```

A `get` request will return the captured page, but if all you want is the url of the archived page you can use a `head` request and extract the information you need from the response headers. Try this:

In [3]:
response = requests.head(
    "https://web.archive.org.au/awa/http://nla.gov.au/",
    headers={"Accept-Datetime": "Fri, 01 Jan 2010 01:00:00 GMT"},
)
response.headers

{'Server': 'nginx', 'Date': 'Thu, 23 Mar 2023 15:03:12 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144751/http://www.nla.gov.au/', 'Link': '<http://www.nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://www.nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://www.nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144751mp_/http://www.nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:47:51 GMT"', 'Vary': 'accept-datetime'}

The request above returns the following headers:

``` python
{
    'Server': 'nginx', 
    'Date': 'Wed, 06 May 2020 04:34:50 GMT', 
    'Content-Length': '0', 'Connection': 'keep-alive', 
    'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 
    'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 
    'Vary': 'accept-datetime'
}
```

The `Link` parameter contains the Memento information. You can see that it's actually providing information on four types of link:

* the `original` url (ie the url that was archived) – `<http://nla.gov.au/>`
* the `timegate` for the harvested url (which us what we just used) – `<https://web.archive.org.au/awa/http://nla.gov.au/>`
* the `timemap` for the harvested url (we'll look at this below) – `<https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>`
* the `memento` – `<https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>`

The `memento` link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as `first memento`, `last memento`, `prev memento`, and `next memento`.

Here's some functions to query a timegate in one of the five systems we're exploring. We'll use them to compare the results we get from each.

In [4]:
def format_date_for_headers(iso_date, tz):
    """
    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
    Convert the datetime to UTC and format as required by Accet-Datetime headers:
    eg Fri, 23 Mar 2007 01:00:00 GMT
    """
    local = arrow.get(f"{iso_date} 12:00:00 {tz}", "YYYY-MM-DD HH:mm:ss ZZZ")
    gmt = local.to("utc")
    return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'


def parse_links_from_headers(response):
    """
    Extract original, timegate, timemap, and memento links from 'Link' header.
    """
    links = response.links
    return {k: v["url"] for k, v in links.items()}


def format_timestamp(timestamp, date_format="YYYY-MM-DD HH:mm:ss"):
    return arrow.get(timestamp, "YYYYMMDDHHmmss").format(date_format)


def test_timegate(
    timegate,
    url,
    date=None,
    tz="Australia/Canberra",
    request_type="head",
    allow_redirects=True,
):
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers["Accept-Datetime"] = formatted_date
    # Note that you don't get a timegate response if you leave off the trailing slash
    tg_url = (
        f"{TIMEGATES[timegate]}{url}/"
        if not url.endswith("/")
        else f"{TIMEGATES[timegate]}{url}"
    )
    print(tg_url)
    if request_type == "head":
        response = requests.head(
            tg_url, headers=headers, allow_redirects=allow_redirects
        )
    else:
        response = requests.get(
            tg_url, headers=headers, allow_redirects=allow_redirects
        )
    response.raise_for_status()
    # print(response.headers)
    return parse_links_from_headers(response)

### Australian Web Archive

A `HEAD` request that follows redirects returns no results

In [5]:
result = test_timegate("awa", "http://www.nla.gov.au")

# Test for expected result
assert result == {}

result

https://web.archive.org.au/awa/http://www.nla.gov.au/


{}

----
A `HEAD` request that doesn't follow redirects returns results as expected

In [6]:
result = test_timegate("awa", "http://www.nla.gov.au", allow_redirects=False)

# Test for expected result
assert "memento" in result

result

https://web.archive.org.au/awa/http://www.nla.gov.au/


{'original': 'https://www.nla.gov.au/',
 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/',
 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}

----
A query without an `Accept-Datetime` value returns a recent capture.

In [7]:
result = test_timegate("awa", "http://www.nla.gov.au", allow_redirects=False)

# Test for expected result
assert "memento" in result

result

https://web.archive.org.au/awa/http://www.nla.gov.au/


{'original': 'https://www.nla.gov.au/',
 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/',
 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}

----

A query with an `Accept-Datetime` value of 1 January 2002 returns a capture from 20 January 2002.

In [8]:
result = test_timegate(
    "awa", "http://www.education.gov.au/", date="2002-01-01", allow_redirects=False
)

# Test for expected result
assert "memento" in result
assert "20020120" in result["memento"]

result

https://web.archive.org.au/awa/http://www.education.gov.au/


{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

----

Using a `GET` rather than a `HEAD` request returns no Memento information when redirects are followed.

In [9]:
result = test_timegate(
    "awa", "http://www.education.gov.au/", date="2002-01-01", request_type="get"
)

# Test for expected result
assert result == {}

result

https://web.archive.org.au/awa/http://www.education.gov.au/


{}

----

Using a `GET` rather than a `HEAD` request returns Memento information when redirects are not followed.

In [10]:
result = test_timegate(
    "awa",
    "http://www.education.gov.au/",
    date="2002-01-01",
    request_type="get",
    allow_redirects=False,
)

# Test for expected result
assert "memento" in result

result

https://web.archive.org.au/awa/http://www.education.gov.au/


{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

### New Zealand Web Archive

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an `Accept-Datetime` returns a recent capture.

In [None]:
result = test_timegate("nzwa", "http://natlib.govt.nz")

# Test for expected result
assert "memento" in result

result

----

A query with an `Accept-Datetime` value of 1 January 2005 returns a `memento` from July 2004.

In [None]:
result = test_timegate("nzwa", "http://natlib.govt.nz", date="2005-01-01")

# Test for expected result
assert "memento" in result
assert "20040711" in result["memento"]

result

----

A `GET` request returns the same results as a `HEAD` request.

In [None]:
result_head = test_timegate("nzwa", "http://natlib.govt.nz", date="2005-01-01")
result_get = test_timegate(
    "nzwa", "http://natlib.govt.nz", date="2005-01-01", request_type="get"
)

# Test for expected result
assert result_head == result_get

result_get

### Internet Archive

Using a `HEAD` request that follows redirects returns results as expected.

In [19]:
result = test_timegate("ia", "http://discontents.com.au")

# Test for expected result
assert "memento" in result
# IA responses have additional fields
assert "first memento" in result

result

https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20230313181957/https://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/'}

----
Using a `HEAD` request returns no Memento information if redirects are not followed.

In [16]:
result = test_timegate("ia", "http://discontents.com.au", allow_redirects=False)

# Test for expected result
assert result == {}

result

https://web.archive.org/web/http://discontents.com.au/


{}

----

A query without an `Accept-Datetime` value returns a `memento` and also includes a `first memento`, `last memento`, `prev memento`, and `last memento`.

In [17]:
result = test_timegate("ia", "http://discontents.com.au")

# Test for expected result
assert "memento" in result
# IA responses have additional fields
assert "first memento" in result

result

https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20220323201952/http://www.discontents.com.au/',
 'memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}

----

A query with an `Accept-Datetime` value of 1 January 2010 returns a `memento` from 9 February 2010.

In [18]:
result = test_timegate("ia", "http://discontents.com.au", date="2010-01-01")

# Test for expected result
assert "memento" in result
assert "20100209" in result["memento"]

result

https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://discontents.com.au:80/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',
 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',
 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}

----
`GET` requests return different results if redirects are not followed.

In [19]:
result = test_timegate(
    "ia", "http://discontents.com.au", date="2010-01-01", request_type="get"
)
result_no_redirects = test_timegate(
    "ia",
    "http://discontents.com.au",
    date="2010-01-01",
    request_type="get",
    allow_redirects=False,
)

# Test for expected result
assert result != result_no_redirects

result_no_redirects

https://web.archive.org/web/http://discontents.com.au/
https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/'}

### UK Web Archive

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an `Accept-Datetime` value returns a recent capture.

In [14]:
result = test_timegate("ukwa", "http://bl.uk")

# Test for expected result
assert "memento" in result

result

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'https://www.bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/https://www.bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/https://www.bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20230319105859mp_/https://www.bl.uk/'}

----

A query with an `Accept-Datetime` value of 1 January 2006 returns a `memento` from 4 May 2004.

In [21]:
result = test_timegate("ukwa", "http://bl.uk", date="2006-01-01")

# Test for expected result
assert "memento" in result
assert "20040504" in result["memento"]

result

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://www.bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}

----

A `GET` request returns the same results as a `HEAD` request.

In [22]:
result_head = test_timegate("ukwa", "http://bl.uk", date="2006-01-01")
result_get = test_timegate(
    "ukwa", "http://bl.uk", date="2006-01-01", request_type="get"
)

# Test for expected result
assert result_head == result_get

result_get

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://www.bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}

### UK Government Web Archive

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an `Accept-Datetime`  value returns a recent capture.

In [15]:
result = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/")

# Test for expected result
assert "memento" in result

result

https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/


{'original': 'https://www.nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20230311073241mp_/https://www.nationalarchives.gov.uk/'}

----

A query with an `Accept-Datetime` value of 1 January 2006 returns a `memento` from 13 February 2006.

In [20]:
result = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01")

# Test for expected result
assert "memento" in result
assert "20060213" in result["memento"]

result

https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/


{'original': 'http://www.nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}

----

A `GET` request returns the same results as a `HEAD` request.

In [21]:
result_head = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01")
result_get = test_timegate(
    "ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01", request_type="get")

# Test for expected result
assert result_head == result_get

result_get

https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/
https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/


{'original': 'http://www.nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}

### Summarising the differences

As you can see above, there are a couple of significant differences in the way that Timegates behave across the five repositories.

* Wayback systems (IA) provide more information than the Pywb systems (`first memento`, `last memento`, `prev memento`, and `last memento`)
* You can use either `HEAD` or `GET` with UKWA, NZWA, and UKGWA, but IA and AWA behave different depending on the type of request and whether redirects are followed. To get results from either a `HEAD` or `GET` request, AWA requests should not follow redirects. To get results from a `HEAD` requests, IA requests should follow redirects. `GET` requests to IA will return results whether or not redirects are allowed, however, those results differ.

### Normalising Timegate responses and queries

Here's some code to smooth out the differences between systems, and return Memento data as a Python dictionary. Specifically it:

* Follows redirects for requests to the IA.
* If there is no `memento` value in the response (as sometimes happens with NLNZ), it looks for a `first`, `last`, `prev` or `next` value instead.

In [22]:
def query_timegate(timegate, url, date=None, tz="Australia/Canberra"):
    """
    Query the specified repository for a Memento.
    """
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers["Accept-Datetime"] = formatted_date
    
    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!
    tg_url = (
        f"{TIMEGATES[timegate]}{url}/"
        if not url.endswith("/")
        else f"{TIMEGATES[timegate]}{url}"
    )
    # print(tg_url)
    # IA only works if redirects are followed -- this defaults to False with HEAD requests...
    if timegate == "ia":
        allow_redirects = True
    else:
        allow_redirects = False
    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)
    response.raise_for_status()
    return parse_links_from_headers(response)


def get_memento(timegate, url, date=None, tz="Australia/Canberra"):
    """
    If there's no memento in the results, look for an alternative.
    """
    links = query_timegate(timegate, url, date, tz)
    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness
    if links:
        if "memento" in links:
            memento = links["memento"]
        elif "prev memento" in links:
            memento = links["prev memento"]
        elif "next memento" in links:
            memento = links["next memento"]
        elif "last memento" in links:
            memento = links["last memento"]
    else:
        memento = None
    return memento

Now we can request a Memento from any of the five repositories and get back the results as a Python dictionary. You can see this code in action in the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook.

In [22]:
result = query_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2015-01-01")

# Test for expected result
assert "memento" in result

result

{'original': 'http://nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20141223091614mp_/http://nationalarchives.gov.uk/'}

Or if we just want to get the url for a Memento (and fallback to alternative values if `memento` is missing).

In [23]:
result = get_memento("nzwa", "http://natlib.govt.nz")

# Test for expected result
assert result.startswith("https://ndhadeliver.natlib.govt.nz/webarchive/")

result

'https://ndhadeliver.natlib.govt.nz/webarchive/20220801082654mp_/http://natlib.govt.nz/'

----

## Timemaps

Memento Timemaps provide machine-processable lists of web page captures from a particular archive. They are available from both OpenWayback and Pywb systems, though there are some differences. The [Pywb documentation](https://pywb.readthedocs.io/en/latest/manual/memento.html#timemap-api) notes that the following formats are available:

* link – returns an application/link-format as required by the Memento spec
* cdxj – returns a timemap in the native CDXJ format
* json – returns the timemap as newline-delimited JSON lines (NDJSON) format

Timemaps are requested using a url with the following format:

```
http://[address.of.archive]/[collection]/timemap/[format]/[web page url]
```

So if you wanted to query the Australian Web Archive to get a list of captures in JSON format from http://nla.gov.au/ you'd use this url:

```
https://web.archive.org.au/awa/timemap/json/http://nla.gov.au/
```

The examples below show how the format and behaviour of Timemaps vary slightly across the five respoitories we're interested in.

In [14]:
def get_timemap(timegate, url, format="json"):
    """
    Basic function to get a Timemap for the supplied url.
    """
    tg_url = f"{TIMEGATES[timegate]}timemap/{format}/{url}/"
    response = requests.get(tg_url)
    response.raise_for_status()
    # Show the content-type
    # print(response.headers['content-type'])
    return response.headers["content-type"], response.text

### National Library of Australia

Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`.

In [23]:
content_type, timemap = get_timemap("awa", "http://www.gov.au", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

# Show the first 5 lines
print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://web.archive.org.au/awa/timemap/link/http://www.gov.au/>; rel="self"; type="application/link-format"; from="Wed, 06 Dec 2000 21:15:00 GMT",
<https://web.archive.org.au/awa/http://www.gov.au/>; rel="timegate",
<http://www.gov.au/>; rel="original",
<https://web.archive.org.au/awa/20001206211500mp_/http://www.gov.au/>; rel="memento"; datetime="Wed, 06 Dec 2000 21:15:00 GMT"; collection="awa",
<https://web.archive.org.au/awa/20010118203600mp_/http://www.gov.au/>; rel="memento"; datetime="Thu, 18 Jan 2001 20:36:00 GMT"; collection="awa",


----

Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`.

In [15]:
content_type, timemap = get_timemap(
    "awa",
    "http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm",
    "json",
)

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

# Show the first line
print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm", "timestamp": "20031122074837", "url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "length": "3446", "source": "awa", "source-coll": "awa"}


----

Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`.

In [26]:
content_type, timemap = get_timemap(
    "awa",
    "http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm",
    "cdxj",
)

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

# Show the first line
print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm 20031122074837 {"url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "length": "3446", "source": "awa", "source-coll": "awa"}


### UK Web Archive

Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`.

In [27]:
content_type, timemap = get_timemap("ukwa", "http://bl.uk", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/>; rel="self"; type="application/link-format"; from="Tue, 30 Oct 2001 00:00:19 GMT",
<https://www.webarchive.org.uk/wayback/archive/http://bl.uk/>; rel="timegate",
<http://bl.uk/>; rel="original",
<https://www.webarchive.org.uk/wayback/archive/20011030000019mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 30 Oct 2001 00:00:19 GMT"; collection="archive",
<https://www.webarchive.org.uk/wayback/archive/20011113000000mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 13 Nov 2001 00:00:00 GMT"; collection="archive",


----

Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`.

In [28]:
content_type, timemap = get_timemap("ukwa", "http://bl.uk", "json")

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "uk,bl)/", "timestamp": "20011030000019", "url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "allow"}


----

Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`.

In [29]:
content_type, timemap = get_timemap("ukwa", "http://bl.uk", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
uk,bl)/ 20011030000019 {"url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "allow"}


### UK Government Web Archive

Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`.

In [25]:
content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk//>; rel="self"; type="application/link-format"; from="Mon, 20 Oct 2003 01:04:12 GMT",
<https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk//>; rel="timegate",
<https://www.nationalarchives.gov.uk//>; rel="original",
<https://webarchive.nationalarchives.gov.uk/ukgwa/20031020010412mp_/http://www.nationalarchives.gov.uk:80/>; rel="memento"; datetime="Mon, 20 Oct 2003 01:04:12 GMT"; collection="full_zipnum",
<https://webarchive.nationalarchives.gov.uk/ukgwa/20040104233258mp_/http://www.nationalarchives.gov.uk/>; rel="memento"; datetime="Sun, 04 Jan 2004 23:32:58 GMT"; collection="full_zipnum",


----

Request a Timemap in `json` format. This returns `ndjson` (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include `content-type` of `text/x-ndjson`.

In [26]:
content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "json")

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "uk,gov,nationalarchives)/", "timestamp": "20031020010412", "url": "http://www.nationalarchives.gov.uk:80/", "mime": "text/html", "status": "200", "digest": "U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J", "redirect": "-", "robotflags": "-", "length": "951", "offset": "898", "filename": "UKGOV-WEEKLY-010-031019180412-000.warc.gz", "source": "full_zipnum", "source-coll": "full_zipnum", "access": "allow"}


----

Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`.

In [27]:
content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
uk,gov,nationalarchives)/ 20031020010412 {"url": "http://www.nationalarchives.gov.uk:80/", "mime": "text/html", "status": "200", "digest": "U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J", "redirect": "-", "robotflags": "-", "length": "951", "offset": "898", "filename": "UKGOV-WEEKLY-010-031019180412-000.warc.gz", "source": "full_zipnum", "source-coll": "full_zipnum", "access": "allow"}


### National Library of New Zealand

Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`.

In [32]:
content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/>; rel="self"; type="application/link-format"; from="Sun, 11 Jul 2004 21:32:25 GMT",
<https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/>; rel="timegate",
<http://natlib.govt.nz/>; rel="original",
<https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Sun, 11 Jul 2004 21:32:25 GMT"; collection="webarchive",
<https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Tue, 04 Jul 2006 03:31:35 GMT"; collection="webarchive",


----

Request a Timemap in `json` format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson.

In [33]:
content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "json")

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "nz,govt,natlib)/", "timestamp": "20040711213225", "url": "http://www.natlib.govt.nz/", "mime": "text/html", "status": "200", "digest": "JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK", "redirect": "-", "robotflags": "-", "length": "0", "offset": "976", "filename": "V1-FL1645590.arc", "load_url": "http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/", "source": "webarchive", "source-coll": "webarchive"}


----

Request a Timemap in `cdxj` format. Note that response headers include `content-type` of `text/x-cdxj`.

In [34]:
content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
nz,govt,natlib)/ 20040711213225 {"url": "http://www.natlib.govt.nz/", "mime": "text/html", "status": "200", "digest": "JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK", "redirect": "-", "robotflags": "-", "length": "0", "offset": "976", "filename": "V1-FL1645590.arc", "load_url": "http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/", "source": "webarchive", "source-coll": "webarchive"}


### Internet Archive

Request a Timemap in `link` format. Note that response headers include `content-type` of `application/link-format`.

In [35]:
content_type, timemap = get_timemap("ia", "http://discontents.com.au", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<http://www.discontents.com.au:80/>; rel="original",
<https://web.archive.org/web/timemap/link/http://discontents.com.au/>; rel="self"; type="application/link-format"; from="Sun, 06 Dec 1998 01:22:33 GMT",
<https://web.archive.org/web/http://discontents.com.au/>; rel="timegate",
<https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/>; rel="first memento"; datetime="Sun, 06 Dec 1998 01:22:33 GMT",
<https://web.archive.org/web/19981212024410/http://www.discontents.com.au:80/>; rel="memento"; datetime="Sat, 12 Dec 1998 02:44:10 GMT",


----

Request for timemap in `json` format returns results in JSON as an array of arrays, where the first row provides the column headings. Response headers include `content-type` of `application/json`.

In [36]:
content_type, timemap = get_timemap("ia", "http://discontents.com.au", "json")

print(content_type)
# Test content type
assert content_type == "application/json"

print("\n".join(timemap.splitlines()[:5]))

application/json
[["urlkey","timestamp","original","mimetype","statuscode","digest","redirect","robotflags","length","offset","filename"],
["au,com,discontents)/","19981206012233","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1610","43993900","green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz"],
["au,com,discontents)/","19981212024410","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","17792789","slash-913417727-c/slash-913430608.arc.gz"],
["au,com,discontents)/","19990125094813","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","11419234","slash-913417727-c/slash_19990124232053-917257670.arc.gz"],
["au,com,discontents)/","19990208004052","http://discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1612","13269748","slash-913417727-c/slash-918434425.arc.gz"],


----

Request for timemap in `cdxj` returns results in plain text, with fields separated by spaces, and captures separated by line breaks. Response headers include `content-type` of `text/plain`.

In [37]:
content_type, timemap = get_timemap("ia", "http://discontents.com.au", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/plain"

print("\n".join(timemap.splitlines()[:1]))

text/plain
au,com,discontents)/ 19981206012233 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1610 43993900 green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz


### Differences in field labels

If we compare the Pywb JSON output with the IA Wayback output, we see there are also some differences in the field labels. In particular `original` in IA Wayback is just `url` in Pywb, while `statuscode` and `mimetype` are shortened to `status` and `mime` in Pywb.

In [28]:
_, timemap = get_timemap("ia", "http://bl.uk", "json")
data = json.loads(timemap)

# Test for `mimetype` label
assert "mimetype" in data[0]

data[0]

['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename']

In [16]:
_, timemap = get_timemap("ukwa", "http://bl.uk", "json")
data = [json.loads(line) for line in timemap.splitlines()]

# Test for `mime` label
assert "mime" in data[0]

list(data[0].keys())

['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename',
 'load_url',
 'source',
 'source-coll',
 'access']

### Summarising the differences

The good news is that all repositories provide Timemaps in the standard `link` format as required by the Memento specification. However, there's more varation when it comes to other formats.

* IA's `json` format is different to the Pywb format from UKWA, UKGWA, NLNZ, and NLA. 
* IA uses different labels for some values.

### Normalising Timemaps

With the information above we can construct some functions to return normalised Timemap results as JSON. To do this we need to:

* Restructure the JSON output from IA to match the Pywb format
* Change some of the column headings in the IA data to match the Pywb format

Because the `link` format provides less information than the `json` format, we could also try to enrich the NLNZ data by requesting more information about individual Mementos.

In [12]:
def convert_lists_to_dicts(results):
    """
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    """
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d["status"] = d.pop("statuscode")
        d["mime"] = d.pop("mimetype")
        d["url"] = d.pop("original")
    return results_as_dicts


def get_capture_data_from_memento(url, request_type="head"):
    """
    For OpenWayback systems this can get some extra capture info to insert into Timemaps.
    """
    if request_type == "head":
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get("x-archive-orig-content-length")
    status = headers.get("x-archive-orig-status")
    status = status.split(" ")[0] if status else None
    mime = headers.get("x-archive-orig-content-type")
    mime = mime.split(";")[0] if mime else None
    return {"length": length, "status": status, "mime": mime}


def convert_link_to_json(results, enrich_data=False):
    """
    Converts link formatted Timemap to JSON.

    This was originally needed for NLNZ, but now all five archives
    return JSON data.
    """
    data = []
    for line in results.splitlines():
        parts = line.split("; ")
        if len(parts) > 1:
            link_type = re.search(
                r'rel="(original|self|timegate|first memento|last memento|memento)"',
                parts[1],
            ).group(1)
            if link_type == "memento":
                link = parts[0].strip("<>")
                timestamp, original = re.search(r"/(\d{12}|\d{14})/(.*)$", link).groups()
                capture = {"timestamp": timestamp, "url": original}
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                    # print(capture)
                data.append(capture)
    return data


def get_timemap_as_json(timegate, url):
    """
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    """
    tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
    response = requests.get(tg_url)
    response.raise_for_status()
    response_type = response.headers["content-type"]
    # print(response_type)
    if response_type == "text/x-ndjson":
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == "application/json":
        data = convert_lists_to_dicts(response.json())
    elif response_type in ["application/link-format", "text/html;charset=utf-8"]:
        data = convert_link_to_json(response.text)
    return data

Now we can get information about captures in a standardised JSON format from all five repositories. You can see this in action in the [Display changes in the text of an archived web page over time](display-text-changes-from-timemap.ipynb) notebook

In [13]:
timemap = get_timemap_as_json("ukwa", "http://bl.uk")

# Test for `mime` label
assert "mime" in timemap[0]

timemap[0]

{'urlkey': 'uk,bl)/',
 'timestamp': '20011030000019',
 'url': 'http://www.bl.uk/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW',
 'redirect': '-',
 'robotflags': '-',
 'length': '0',
 'offset': '10813988',
 'filename': '/data/102148/31031347/WARCS/BL-31031347.warc.gz',
 'load_url': '',
 'source': 'archive',
 'source-coll': 'archive',
 'access': 'allow'}

In [14]:
timemap = get_timemap_as_json("ia", "http://bl.uk")

# Test for `mime` label
assert "mime" in timemap[0]

timemap[0]

{'urlkey': 'uk,bl)/',
 'timestamp': '19970218190613',
 'digest': 'Z42UMUL76GODKO3EMNSLXDTCST66VDAX',
 'redirect': '-',
 'robotflags': '-',
 'length': '1208',
 'offset': '19524651',
 'filename': 'GR-001114-c/GR-002277.arc.gz',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.bl.uk:80/'}

----

## Mementos

You can also modify the url of a Memento to change the way it's presented. In particular, adding `id_` after the timestamp will tell the server that you want the original harvested version of the webpage, without any rewriting of links, or web archive navigation features. For example:

```
https://web.archive.org.au/awa/20200302223537id_/http://discontents.com.au/
```

This works with all five repositories, however, note that for the Australian Web Archive you need to use the `web.archive.org.au` domain, not `webarchive.nla.gov.au`.

In addition, IA supports the `if_` option, which provides a view of the archived page without web archive headers navigation inserted, but with links to CSS, JS, and images rewritten to point to archived versions. This is as close as you can get to looking at the original page, and I've used it in the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook. Note that if you add `if_` to requests from the UKWA, NLNZ, or the NLA you'll be redirected to the standard view with the original page framed by the web archive navigation.

Pywb's page on [url rewriting](https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting) has some useful information about this.

----
Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!

Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).

The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/).