{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Find all the archived versions of a web page\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "You can find all the archived versions of a web page by requesting a Timemap from a Memento-compliant repository. If the repository has a CDX API, you can get much the same data by doing an exact url search." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "\n", "import requests\n", "from surt import surt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Timemaps\n", "\n", "**Works with AWA, IA, NZWA, UKWA & UKGWA**\n", "\n", "Variations in the way Memento is implemented across repositories are documented in [Getting data from web archives using Memento](memento.ipynb). The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, UK Government Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.\n", "\n", "To get all captures of a url in JSON format:\n", "\n", "``` python\n", "get_timemap_as_json([timegate], [url], enrich_data=[True or False])\n", "```\n", "\n", "Parameters:\n", "\n", "* `timegate` – one of 'ukwa' (UK Web Archive), 'ukgwa' (UK Government Web Archive) 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n", "* `url` – the url you want to look for in the archive\n", "* `enrich_data` – NZWA Timemaps include less information, if you set this to `True` the script will query each memento in turn to try and find more capture information (such as `mime` and `status`). This will slow things down quite a bit, and isn't always successful, so leave it as `False` unless you have a good reason.\n", "\n", "The data is returned in JSON format. The number of fields returned varies, but these will always be present:\n", "\n", "* `urlkey` – SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)\n", "* `timestamp` – the date and time when the page was captured by the archive, in `YYYYMMDDHHmmss` format\n", "* `url` – the url of the page that was captured\n", "\n", "The AWA, IA, and UKWA Timemaps also include:\n", "\n", "* `status` – HTTP status code returned by the capture request\n", "* `mime` – the mimetype of the captured resource\n", "* `digest` – algorithmically generated string that uniquely identifies the contents of the captured reource\n", "\n", "For more information on the contents of these fields, see [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb).\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# These are the repositories we'll be using\n", "TIMEGATES = {\n", " \"awa\": \"https://web.archive.org.au/awa/\",\n", " \"nzwa\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"ukwa\": \"https://www.webarchive.org.uk/wayback/en/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", " \"ukgwa\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/\"\n", "}\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def get_capture_data_from_memento(url, request_type=\"head\"):\n", " \"\"\"\n", " For OpenWayback systems this can get some extra capture info to insert into Timemaps.\n", " \"\"\"\n", " if request_type == \"head\":\n", " response = requests.head(url)\n", " else:\n", " response = requests.get(url)\n", " headers = response.headers\n", " length = headers.get(\"x-archive-orig-content-length\")\n", " status = headers.get(\"x-archive-orig-status\")\n", " status = status.split(\" \")[0] if status else None\n", " mime = headers.get(\"x-archive-orig-content-type\")\n", " mime = mime.split(\";\")[0] if mime else None\n", " return {\"length\": length, \"status\": status, \"mime\": mime}\n", "\n", "\n", "def convert_link_to_json(results, enrich_data=False):\n", " \"\"\"\n", " Converts link formatted Timemap to JSON.\n", " \"\"\"\n", " data = []\n", " for line in results.splitlines():\n", " parts = line.split(\"; \")\n", " if len(parts) > 1:\n", " link_type = re.search(\n", " r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n", " parts[1],\n", " ).group(1)\n", " if link_type == \"memento\":\n", " link = parts[0].strip(\"<>\")\n", " timestamp, original = re.search(r\"/(\\d{12}|\\d{14})/(.*)$\", link).groups()\n", " capture = {\n", " \"urlkey\": surt(original),\n", " \"timestamp\": timestamp,\n", " \"url\": original,\n", " }\n", " if enrich_data:\n", " capture.update(get_capture_data_from_memento(link))\n", " print(capture)\n", " data.append(capture)\n", " return data\n", "\n", "\n", "def get_timemap_as_json(timegate, url, enrich_data=False):\n", " \"\"\"\n", " Get a Timemap then normalise results (if necessary) to return a list of dicts.\n", " \"\"\"\n", " tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n", " response = requests.get(tg_url)\n", " response_type = response.headers[\"content-type\"]\n", " # print(response_type)\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " elif response_type in [\n", " \"application/link-format\",\n", " \"application/link-format;charset=ISO-8859-1\",\n", " \"text/html;charset=utf-8\",\n", " ]:\n", " data = convert_link_to_json(response.text, enrich_data=enrich_data)\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "382" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t1 = get_timemap_as_json(\"ia\", \"http://discontents.com.au\")\n", "len(t1)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '19981206012233',\n", " 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',\n", " 'redirect': '-',\n", " 'robotflags': '-',\n", " 'length': '1610',\n", " 'offset': '43993900',\n", " 'filename': 'green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz',\n", " 'status': '200',\n", " 'mime': 'text/html',\n", " 'url': 'http://www.discontents.com.au:80/'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First -- results in date order\n", "t1[0]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '20230318003745',\n", " 'digest': 'LK7AWVZ7UN745CBGJNEVA3QJMKLJ4N4V',\n", " 'redirect': '-',\n", " 'robotflags': '-',\n", " 'length': '652',\n", " 'offset': '37282813',\n", " 'filename': 'CT-20230318000619-crawl896/CT-20230318003748-00104.warc.gz',\n", " 'status': '-',\n", " 'mime': 'warc/revisit',\n", " 'url': 'http://discontents.com.au/'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Last -- the most recent\n", "t1[-1]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "942" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t2 = get_timemap_as_json(\"ukwa\", \"http://bl.uk\")\n", "len(t2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1372" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t3 = get_timemap_as_json(\"nzwa\", \"http://natlib.govt.nz\")\n", "len(t3)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "720" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t4 = get_timemap_as_json(\"ukgwa\", \"http://www.mod.uk/\")\n", "len(t4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the CDX API\n", "\n", "**Works with AWA, IA, NZWA, UKWA & UKGWA**\n", "\n", "The CDX APIs of the Internet Archive and PyWb-based systems such as the AWA, UKWA, UKGWA, and NZWA behave slightly differently. These differences are documented in [Comparing CDX APIs](comparing_cdx_apis.ipynb). The functions below smooth out some of these bumps and should return consistently formatted results from the three repositories.\n", "\n", "To get all the captures of a url in JSON format:\n", "\n", "``` python\n", "query_cdx([timegate], [url], [other optional parameters])\n", "```\n", "\n", "Required parameters:\n", "\n", "* `timegate` – one of 'ukwa' (UK), 'ukgwa' (UKGWA) 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)\n", "* `url` – the url you want to look for in the archive\n", "\n", "Supplying these parameters only is essentially the equivalent of asking for a Timemap (though when I [compared results](getting_all_snapshots_timemap_vs_cdx.ipynb), I found the CDX API included more duplicates). One advantage of the CDX API is that you can filter results by supplying additional parameters. These optional parameters can be anything the CDX APIs support, such as `from`, `to`, and `filter`. However, note that `from` is a reserved keyword in Python, so use `from_` instead. See below for some examples.\n", "\n", "The data is returned in JSON format. The number of fields returned varies, but these will always be present:\n", "\n", "* `urlkey` – SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)\n", "* `timestamp` – the date and time when the page was captured by the archive, in `YYYYMMDDHHmmss` format\n", "* `url` – the url of the page that was captured\n", "* `status` – HTTP status code returned by the capture request\n", "* `mime` – the mimetype of the captured resource\n", "* `digest` – algorithmically generated string that uniquely identifies the contents of the captured reource\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "APIS = {\n", " \"ia\": {\"url\": \"http://web.archive.org/cdx/search/cdx\", \"type\": \"wb\"},\n", " \"awa\": {\"url\": \"https://web.archive.org.au/awa/cdx\", \"type\": \"pywb\"},\n", " \"nzwa\": {\n", " \"url\": \"https://ndhadeliver.natlib.govt.nz/webarchive/cdx\",\n", " \"type\": \"pywb\",\n", " },\n", " \"ukwa\": {\n", " \"url\": \"https://www.webarchive.org.uk/wayback/archive/cdx\",\n", " \"type\": \"pywb\",\n", " },\n", " \"ukgwa\": {\n", " \"url\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/cdx\",\n", " \"type\": \"pywb\",\n", " },\n", "}\n", "\n", "\n", "def normalise_filter(api, f):\n", " \"\"\"\n", " Normalise parameter names and regexp formatting across CDX systems.\n", " \"\"\"\n", " sys_type = APIS[api][\"type\"]\n", " if sys_type == \"pywb\":\n", " f = f.replace(\"mimetype:\", \"mime:\")\n", " f = f.replace(\"statuscode:\", \"status:\")\n", " f = f.replace(\"original:\", \"url:\")\n", " f = re.sub(r\"^(!{0,1})(\\w)\", r\"\\1~\\2\", f)\n", " elif sys_type == \"wb\":\n", " f = f.replace(\"mime:\", \"mimetype:\")\n", " f = f.replace(\"status:\", \"statuscode:\")\n", " f = f.replace(\"url:\", \"original:\")\n", " return f\n", "\n", "\n", "def normalise_filters(api, filters):\n", " if isinstance(filters, list):\n", " normalised = []\n", " for f in filters:\n", " normalised.append(normalise_filter(api, f))\n", " else:\n", " normalised = normalise_filter(api, filters)\n", " return normalised\n", "\n", "\n", "def query_cdx(api, url, **kwargs):\n", " params = kwargs\n", " if \"filter\" in params:\n", " params[\"filter\"] = normalise_filters(api, params[\"filter\"])\n", " # CDX accepts a 'from' parameter, but this is a reserved word in Python\n", " # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.\n", " if \"from_\" in params:\n", " params[\"from\"] = params[\"from_\"]\n", " del params[\"from_\"]\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " response = requests.get(APIS[api][\"url\"], params=params)\n", " response.raise_for_status()\n", " response_type = response.headers[\"content-type\"].split(\";\")[0]\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "383" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# No filters -- give as all the captures!\n", "d1 = query_cdx(\"ia\", \"http://discontents.com.au\")\n", "len(d1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '19981206012233',\n", " 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',\n", " 'length': '1610',\n", " 'status': '200',\n", " 'mime': 'text/html',\n", " 'url': 'http://www.discontents.com.au:80/'}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First result\n", "d1[0]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '20230318003745',\n", " 'digest': 'LK7AWVZ7UN745CBGJNEVA3QJMKLJ4N4V',\n", " 'length': '652',\n", " 'status': '-',\n", " 'mime': 'warc/revisit',\n", " 'url': 'http://discontents.com.au/'}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Last result -- note that the results are in date order, so this is the most recent\n", "d1[-1]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "330" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter by status code - note the number of results decreases\n", "d2 = query_cdx(\"ia\", \"http://discontents.com.au\", filter=\"status:200\")\n", "len(d2)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter by date range using from_ and to\n", "d3 = query_cdx(\"ia\", \"http://discontents.com.au\", from_=\"2005\", to=\"2006\")\n", "len(d3)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '20050209204432',\n", " 'digest': 'IWLJRLZLB7WBQNHYTVXJGD7TTARRGAXM',\n", " 'length': '1024',\n", " 'status': '200',\n", " 'mime': 'text/html',\n", " 'url': 'http://www.discontents.com.au:80/'}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First result should be from 2005\n", "d3[0]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '20061205043957',\n", " 'digest': 'QGCDU54UYAOMFBTZKGOV27NGYAFE27HZ',\n", " 'length': '1122',\n", " 'status': '200',\n", " 'mime': 'text/html',\n", " 'url': 'http://discontents.com.au:80/'}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Last result should be from 2006\n", "d3[-1]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "157" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as d1, except from AWA\n", "d4 = query_cdx(\"awa\", \"http://discontents.com.au\")\n", "len(d4)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "720" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# And with the UKGWA\n", "d5 = query_cdx(\"ukgwa\", \"http://www.mod.uk/\")\n", "len(d5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }