{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparing CDX APIs\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "This notebook documents differences between the Internet Archive's CDX API and the CDX API available from PyWb systems such as the UK Web Archive and the National Library of Australia.\n", "\n", "For more details on the data available from the CDX APIs see [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb).\n", "\n", "For examples using CDX APIs to harvest capture data see:\n", "\n", "* [Find all the archived versions of a web page](find_all_captures.ipynb)\n", "* [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Documentation\n", "\n", "* [Wayback CDX API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server)\n", "* [PyWb CDXJ Server API](https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html)\n", "* [PyWb indexes](https://pywb.readthedocs.io/en/latest/manual/indexing.html)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "\n", "import pandas as pd\n", "import pytest\n", "import requests" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "APIS = {\n", " \"ia\": {\"url\": \"http://web.archive.org/cdx/search/cdx\", \"type\": \"wb\"},\n", " \"nla\": {\"url\": \"https://web.archive.org.au/awa/cdx\", \"type\": \"pywb\"},\n", " \"bl\": {\"url\": \"https://www.webarchive.org.uk/wayback/archive/cdx\", \"type\": \"pywb\"},\n", " \"nlnz\": {\n", " \"url\": \"https://ndhadeliver.natlib.govt.nz/webarchive/cdx\",\n", " \"type\": \"pywb\",\n", " },\n", "}\n", "\n", "\n", "def raw_cdx_query(api, url, **kwargs):\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " response = requests.get(APIS[api][\"url\"], params=params, timeout=60)\n", " response.raise_for_status()\n", " return response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Differences between PyWb and IA Wayback" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### JSON results format\n", "\n", "As with Timemaps, requesting `json` formatted results from IA and Pywb CDX servers returns different data structures. IA results are an array of arrays, with the field labels in the first array. Pywb results are formatted as NDJSON (Newline Delineated JSON) – each capture is a JSON object, separated by a line break." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'length'],\n", " ['au,com,discontents)/',\n", " '19981206012233',\n", " 'http://www.discontents.com.au:80/',\n", " 'text/html',\n", " '200',\n", " 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',\n", " '1610']]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_cdx_query(\"ia\", \"discontents.com.au\", limit=1, format=\"json\").json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '19981206012233',\n", " 'url': 'http://www.discontents.com.au/',\n", " 'mime': 'text/html',\n", " 'status': '200',\n", " 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',\n", " 'offset': '59442416',\n", " 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',\n", " 'length': '1610',\n", " 'source': 'awa',\n", " 'source-coll': 'awa'}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json.loads(raw_cdx_query(\"nla\", \"discontents.com.au\", limit=1, format=\"json\").text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Field labels\n", "\n", "As with Timemaps, some of the field labels are different between the two systems:\n", "\n", "|IA|PyWb|\n", "|---|---|\n", "|`original`|`url`|\n", "|`statuscode`|`status`|\n", "|`mimetype`|`mime`|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'length']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_cdx_query(\"ia\", \"discontents.com.au\", limit=1, format=\"json\").json()[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'url',\n", " 'mime',\n", " 'status',\n", " 'digest',\n", " 'offset',\n", " 'filename',\n", " 'length',\n", " 'source',\n", " 'source-coll']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(\n", " json.loads(\n", " raw_cdx_query(\"nla\", \"discontents.com.au\", limit=1, format=\"json\").text\n", " ).keys()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLNZ (PyWb)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'url',\n", " 'mime',\n", " 'status',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename',\n", " 'load_url',\n", " 'source',\n", " 'source-coll']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(\n", " json.loads(\n", " raw_cdx_query(\"nlnz\", \"http://digitalnz.org\", limit=1, format=\"json\").text\n", " ).keys()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### UKWA (PyWb)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'url',\n", " 'mime',\n", " 'status',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename',\n", " 'load_url',\n", " 'source',\n", " 'source-coll',\n", " 'access']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(\n", " json.loads(\n", " raw_cdx_query(\n", " \"bl\", \"anjackson.net\", filter=\"status:200\", limit=1, format=\"json\"\n", " ).text\n", " ).keys()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Match types\n", "\n", "From the documentation it seems that you should be able to supply a `matchType` or use url wildcards on both systems. But there seem to be some inconsistences. In summary:\n", "\n", "* UKWA needs **both** the url wildcard and the `matchType` parameter to work correctly\n", "* domain queries do not work with NLA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)\n", "\n", "Prefix queries work as expected, Domain queries do not work." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "125" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look for an exact url\n", "exact = len(\n", " raw_cdx_query(\n", " \"nla\", \"http://discontents.com.au\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "exact" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "39052" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prefix query using url wildcard works as expected\n", "prefix_url = len(\n", " raw_cdx_query(\n", " \"nla\", \"http://discontents.com.au*\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "prefix_url" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "39052" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prefix query using matchType=prefix works as expected\n", "prefix_match = len(\n", " raw_cdx_query(\n", " \"nla\",\n", " \"http://discontents.com.au\",\n", " filter=\"status:200\",\n", " format=\"json\",\n", " matchType=\"prefix\",\n", " ).text.splitlines()\n", ")\n", "prefix_match" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Domain query using url wildcard causes exception\n", "# This test passes if there is a HTPPError exception\n", "with pytest.raises(requests.exceptions.HTTPError):\n", " raw_cdx_query(\n", " \"nla\", \"*.discontents.com.au\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Domain query using matchType parameter causes exception\n", "# This test passes if there is a HTPPError exception\n", "with pytest.raises(requests.exceptions.HTTPError):\n", " raw_cdx_query(\n", " \"nla\", \"discontents.com.au\", filter=\"status:200\", format=\"json\", matchType=\"domain\"\n", " ).text.splitlines()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Test the results\n", "assert isinstance(exact, int) is True\n", "assert isinstance(prefix_url, int) is True\n", "assert isinstance(prefix_match, int) is True\n", "assert prefix_url > exact\n", "assert prefix_url == prefix_match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### UKWA (PyWb)\n", "\n", "Domain and prefix queries work as expected." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look for an exact url\n", "exact = len(\n", " raw_cdx_query(\n", " \"bl\", \"anjackson.net\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "exact" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24072" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prefix query using url wildcard works as expected\n", "prefix_url = len(\n", " raw_cdx_query(\n", " \"bl\", \"http://anjackson.net/*\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "prefix_url" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24072" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prefix query using matchType prefix works as expected\n", "prefix_match = len(\n", " raw_cdx_query(\n", " \"bl\",\n", " \"http://anjackson.net\",\n", " filter=\"status:200\",\n", " format=\"json\",\n", " matchType=\"prefix\",\n", " ).text.splitlines()\n", ")\n", "prefix_match" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37117" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Domain query using url wildcard works as expected\n", "domain_url = len(\n", " raw_cdx_query(\n", " \"bl\", \"*.anjackson.net\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "domain_url" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37117" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Domain query using matchType parameter works as expected\n", "domain_match = len(\n", " raw_cdx_query(\n", " \"bl\", \"anjackson.net\", filter=\"status:200\", format=\"json\", matchType=\"domain\"\n", " ).text.splitlines()\n", ")\n", "domain_match" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Test the results\n", "assert isinstance(exact, int) is True\n", "assert isinstance(prefix_url, int) is True\n", "assert isinstance(prefix_match, int) is True\n", "assert isinstance(domain_url, int) is True\n", "assert isinstance(domain_match, int) is True\n", "assert prefix_url > exact\n", "assert prefix_url == prefix_match\n", "assert domain_url > exact\n", "assert domain_url > prefix_url\n", "assert domain_url == domain_match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLNZ (pywb)\n", "\n", "Domain and prefix queries work as expected." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "56" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look for an exact url\n", "exact = len(\n", " raw_cdx_query(\n", " \"nlnz\", \"http://digitalnz.org/\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "exact" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16718" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prefix query using url wildcard works as expected\n", "prefix_url = len(\n", " raw_cdx_query(\n", " \"nlnz\", \"http://digitalnz.org/*\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "prefix_url" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16718" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prefix query using matchType prefix works as expected\n", "prefix_match = len(\n", " raw_cdx_query(\n", " \"nlnz\",\n", " \"http://digitalnz.org/\",\n", " filter=\"status:200\",\n", " format=\"json\",\n", " matchType=\"prefix\",\n", " ).text.splitlines()\n", ")\n", "prefix_match" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "72178" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Domain query using url wildcard works as expected\n", "domain_url = len(\n", " raw_cdx_query(\n", " \"nlnz\", \"*.digitalnz.org\", filter=\"status:200\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "domain_url" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "72178" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Domain query using matchType parameter works as expected\n", "domain_match = len(\n", " raw_cdx_query(\n", " \"nlnz\", \"digitalnz.org\", filter=\"status:200\", format=\"json\", matchType=\"domain\"\n", " ).text.splitlines()\n", ")\n", "domain_match" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Test the results\n", "assert isinstance(exact, int) is True\n", "assert isinstance(prefix_url, int) is True\n", "assert isinstance(prefix_match, int) is True\n", "assert isinstance(domain_url, int) is True\n", "assert isinstance(domain_match, int) is True\n", "assert prefix_url > exact\n", "assert prefix_url == prefix_match\n", "assert domain_url > exact\n", "assert domain_url > prefix_url\n", "assert domain_url == domain_match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Collapse\n", "\n", "PyWb doesn't support the `collapse` parameter. So if you want to remove duplicates, you'll need to use something like Pandas `.drop_duplicates()` after the results have arrived. However, `collapse` only works on adjacent index entries, so if only having unique values is important, you'll probably want to run `.drop_duplicates()` on it anyway," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "351" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Without collapse -- total number of results (subtract one for the header row)\n", "complete = len(raw_cdx_query(\"ia\", \"discontents.com.au\", format=\"json\").json()) - 1\n", "complete" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# With collapse -- should only be one result as we're collapsing on urlkey and searching for an exact url\n", "collapsed = (\n", " len(\n", " raw_cdx_query(\n", " \"ia\", \"discontents.com.au\", format=\"json\", collapse=\"urlkey\"\n", " ).json()\n", " )\n", " - 1\n", ")\n", "collapsed" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert complete > collapsed\n", "assert collapsed == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### UKWA (PyWb)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "85" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Without collapse\n", "complete = len(raw_cdx_query(\"bl\", \"anjackson.net\", format=\"json\").text.splitlines())\n", "complete" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "85" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# With collapse\n", "collapsed = len(\n", " raw_cdx_query(\n", " \"bl\", \"anjackson.net\", collapse=\"urlkey\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "collapsed" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "# Collapse has done nothing\n", "assert complete == collapsed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De-duplicate results using Pandas." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [\n", " json.loads(line)\n", " for line in raw_cdx_query(\n", " \"bl\", \"anjackson.net\", fields=\"urlkey\", format=\"json\"\n", " ).text.splitlines()\n", "]\n", "df = pd.DataFrame(data).drop_duplicates(subset=[\"urlkey\"])\n", "deduped = df.shape[0]\n", "deduped" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert deduped == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Sort and Closest\n", "\n", "IA doesn't support `sort` or the `closest` parameter. To implement something similar, I suppose you could use `from` and `to` to set a window around a date, and then process the results to calculate time deltas and sort by 'closeness'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Limiting fields\n", "\n", "The parameter used for limiting the fields returned from a query used to be different, but this has changed in recent PyWb releases. The IA server expects `fl`, while PyWb expects either `fields` or `fl`. So for cross-compaibility, use `fl`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/'}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "use_fl = json.loads(\n", " raw_cdx_query(\"nla\", \"discontents.com.au\", limit=1, fl=\"urlkey\", format=\"json\").text\n", ")\n", "use_fl" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/'}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "use_fields = json.loads(\n", " raw_cdx_query(\n", " \"nla\", \"discontents.com.au\", limit=1, fields=\"urlkey\", format=\"json\"\n", " ).text\n", ")\n", "use_fields" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert use_fl == use_fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### IA (Wayback)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['urlkey'], ['au,com,discontents)/']]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "use_fl_ia = json.loads(\n", " raw_cdx_query(\"ia\", \"discontents.com.au\", limit=1, fl=\"urlkey\", format=\"json\").text\n", ")\n", "use_fl_ia" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "# Text expected results\n", "assert use_fl_ia[1][0] == \"au,com,discontents)/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Comparison operators in filters\n", "\n", "This seems to create the most potential for confusion. In PyWb, the `filter` parameter uses a number of different operators to indicate the type of match required. IA only uses `!`. There's no way of indicating a query should be treated as a regular expression in IA, therefore, all queries are treated as regular expressions.\n", "\n", "| Operator | Example | Result |\n", "|---|---|---|\n", "| no operator | `filter=mime:html` | `mime` field contains 'html'|\n", "| `=` | `filter==mime:text/html` | `mime` field matches 'text/html' exactly |\n", "| `~` | `filter=~status:30\\d{1}` | `status` field matches any 3 digit code starting with 30|\n", "| `!` | `filter=!mime:html` | `mime` field doesn't contain 'html' |\n", "| `!=` | `filter=!=mime:text/html` | `mime` field doesn't match 'text/html' exactly |\n", "| `!~` | `filter=!~status:30\\d{1}` | `status` field doesn't match any 3 digit codes starting with 30 |\n", "\n", "IA filter queries look for an exact match (which could be a regular expression) by default. This can be negated by using the `!` operator.\n", "\n", "| Operator | Example | Result |\n", "|---|---|---|\n", "| no operator | `filter=mimetype:text/html` | `mimetype` field matches 'text/html'|\n", "| `!` | `filter=!mimetype:text/html` | `mimetype` field doesn't match 'text/html' exactly |\n", "\n", "In IA you need to use a regular expression to find a field containing a particular value. So these two expressions should result in the same matching behaviour:\n", "\n", "| PyWb | IA |\n", "|---|---|\n", "|`filter=mime:powerpoint`|`filter=mimetype:.*powerpoint.*`|\n", "\n", "For interoperability, it seems easiest to always use regular expressions, inserting the `~` operator for PyWb systems. So: \n", "\n", "| PyWb | IA |\n", "|---|---|\n", "|`filter=~mime:.*powerpoint.*`|`filter=mimetype:.*powerpoint.*`|\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filters are treated as exact matches by default\n", "ia_exact = len(\n", " raw_cdx_query(\n", " \"ia\",\n", " \"defence.gov.au/*\",\n", " filter=\"mimetype:powerpoint\",\n", " format=\"json\",\n", " collapse=\"urlkey\",\n", " ).json()\n", ")\n", "ia_exact" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "231" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using regex finds results including 'powerpoint' in mimetype\n", "ia_regex = (\n", " len(\n", " raw_cdx_query(\n", " \"ia\",\n", " \"defence.gov.au/*\",\n", " filter=\"mimetype:.*powerpoint.*\",\n", " format=\"json\",\n", " collapse=\"urlkey\",\n", " ).json()\n", " )\n", " - 1\n", ")\n", "ia_regex" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert ia_regex > ia_exact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "177" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter values are treated as regex by default\n", "nla_exact = len(\n", " raw_cdx_query(\n", " \"nla\", \"defence.gov.au/*\", filter=\"mime:powerpoint\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "nla_exact" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "177" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Explicitly use regex\n", "nla_regex = len(\n", " raw_cdx_query(\n", " \"nla\", \"defence.gov.au/*\", filter=\"~mime:.*powerpoint.*\", format=\"json\"\n", " ).text.splitlines()\n", ")\n", "nla_regex" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert nla_exact == nla_regex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Pagination\n", "\n", "Both IA and PyWb *can* support pagination or results, however, it's not available by default in PyWb. It's only available if repositories are [using ZipNum indexes](https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api). None of the UKWA, National Library of Australia, or National Library of New Zealand CDX APIs support pagination. This means that queries to these systems will return **all** matching results in one hit (unless there is a system defined limit). This is something to bear in mind as large requests might be slow and prone to breakage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1\\n'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ia_pages = raw_cdx_query(\n", " \"ia\", \"discontents.com.au\", showNumPages=\"true\", format=\"json\"\n", ").text\n", "ia_pages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '19981206012233',\n", " 'url': 'http://www.discontents.com.au/',\n", " 'mime': 'text/html',\n", " 'status': '200',\n", " 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',\n", " 'offset': '59442416',\n", " 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',\n", " 'length': '1610',\n", " 'source': 'awa',\n", " 'source-coll': 'awa'}" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NLA CDX server just ignores the showNumPages parameter and performs the query as normal\n", "nla_pages = json.loads(\n", " raw_cdx_query(\n", " \"nla\", \"discontents.com.au\", showNumPages=\"true\", format=\"json\"\n", " ).text.splitlines()[0]\n", ")\n", "nla_pages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLNZ (PyWb)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'org,digitalnz)/',\n", " 'timestamp': '20090129060149',\n", " 'url': 'http://www.digitalnz.org/',\n", " 'mime': 'text/html',\n", " 'status': '200',\n", " 'digest': '3CTAFWGHTJMGYCHECAFS4HKHPXIZOMWO',\n", " 'redirect': '-',\n", " 'robotflags': '-',\n", " 'length': '0',\n", " 'offset': '6208429',\n", " 'filename': 'V1-FL994870.arc',\n", " 'load_url': 'http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20090129060149id_/http://www.digitalnz.org/',\n", " 'source': 'webarchive',\n", " 'source-coll': 'webarchive'}" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NLNZ CDX server just ignores the showNumPages parameter and performs the query as normal\n", "nlnz_pages = json.loads(\n", " raw_cdx_query(\n", " \"nlnz\", \"digitalnz.org\", showNumPages=\"true\", format=\"json\"\n", " ).text.splitlines()[0]\n", ")\n", "nlnz_pages" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert ia_pages.strip().isnumeric()\n", "assert isinstance(nla_pages, dict)\n", "assert isinstance(nlnz_pages, dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "### Fuzzy matching\n", "\n", "If your query to a PyWb CDX API returns no matches, the system will use regular expressions to broaden your search and return a set of 'fuzzy' matches. These results will include an `is_fuzzy` field set to a value of `1`. This is not supported in IA.\n", "\n", "While fuzzy matching is useful for discovery, it might not be what you want if you're assembling a specific dataset. In this case you'd need to filter the results to remove the `is_fuzzy` matches." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This should return no results\n", "ia_not_fuzzy = raw_cdx_query(\n", " \"ia\", \"discontents.com.au\", limit=1, filter=\"statuscode:666\", format=\"json\"\n", ").json()\n", "\n", "# Test expected result\n", "assert ia_not_fuzzy == []\n", "ia_not_fuzzy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'urlkey': 'au,com,discontents)/',\n", " 'timestamp': '19981206012233',\n", " 'url': 'http://www.discontents.com.au/',\n", " 'mime': 'text/html',\n", " 'status': '200',\n", " 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36',\n", " 'offset': '59442416',\n", " 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00309-000001.arc.gz',\n", " 'length': '1610',\n", " 'source': 'awa',\n", " 'source-coll': 'awa',\n", " 'is_fuzzy': '1'}" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This would return no results except for fuzzy matching\n", "# Note the status value in the result and the 'is_fuzzy' field\n", "nla_fuzzy = json.loads(\n", " raw_cdx_query(\n", " \"nla\", \"discontents.com.au\", limit=1, filter=\"status:666\", format=\"json\"\n", " ).text\n", ")\n", "\n", "# Test expected result\n", "assert nla_fuzzy[\"is_fuzzy\"] == \"1\"\n", "nla_fuzzy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalising queries\n", "\n", "It would be possible to wrap some code around queries that simulated `collapse` and `closest` across the two systems, but for the moment I'll just focus on some basic normalisation of query parameters and results. The functions below:\n", "\n", "* Normalise field names in queries and results\n", "* Convert results into a list of dictionaries" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def normalise_filter(api, f):\n", " \"\"\"\n", " Standardise field names in filters.\n", " \"\"\"\n", " sys_type = APIS[api][\"type\"]\n", " if sys_type == \"pywb\":\n", " f = f.replace(\"mimetype:\", \"mime:\")\n", " f = f.replace(\"statuscode:\", \"status:\")\n", " f = f.replace(\"original:\", \"url:\")\n", " f = re.sub(r\"^(!{0,1})(\\w)\", r\"\\1~\\2\", f)\n", " elif sys_type == \"wb\":\n", " f = f.replace(\"mime:\", \"mimetype:\")\n", " f = f.replace(\"status:\", \"statuscode:\")\n", " f = f.replace(\"url:\", \"original:\")\n", " return f\n", "\n", "\n", "def normalise_filters(api, filters):\n", " \"\"\"\n", " Standardise field names in filters.\n", " \"\"\"\n", " if isinstance(filters, list):\n", " normalised = []\n", " for f in filters:\n", " normalised.append(normalise_filter(api, f))\n", " else:\n", " normalised = normalise_filter(api, filters)\n", " return normalised\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def query_cdx(api, url, **kwargs):\n", " \"\"\"\n", " Make a request to a CDX API, normalising filters and responses across Wayback & PyWb systems.\n", " \"\"\"\n", " params = kwargs\n", " if \"filter\" in params:\n", " params[\"filter\"] = normalise_filters(api, params[\"filter\"])\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " response = requests.get(APIS[api][\"url\"], params=params)\n", " # print(response.url)\n", " response.raise_for_status()\n", " response_type = response.headers[\"content-type\"].split(\";\")[0]\n", " # print(response_type)\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's some examples – note that the parameters and their values are unchanged, you can just switch repositories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Internet Archive (Wayback)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'urlkey': 'au,gov,defence)/28sqn/ad097.pdf',\n", " 'timestamp': '20140304175138',\n", " 'digest': 'AQBSAVSJJYOYKKLW7GM36PDCYDREFQXA',\n", " 'length': '141731',\n", " 'status': '200',\n", " 'mime': 'application/pdf',\n", " 'url': 'http://www.defence.gov.au/28sqn/AD097.pdf'}]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ia_normalised1 = query_cdx(\n", " \"ia\", \"defence.gov.au/*\", filter=[\"mime:.*pdf\", \"status:200\"], limit=1\n", ")\n", "ia_normalised1" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'urlkey': 'au,gov,defence)/28sqn/ad097.pdf',\n", " 'timestamp': '20140304175138',\n", " 'digest': 'AQBSAVSJJYOYKKLW7GM36PDCYDREFQXA',\n", " 'length': '141731',\n", " 'status': '200',\n", " 'mime': 'application/pdf',\n", " 'url': 'http://www.defence.gov.au/28sqn/AD097.pdf'}]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ia_normalised2 = query_cdx(\n", " \"ia\", \"defence.gov.au/*\", filter=[\"mimetype:.*pdf\", \"status:200\"], limit=1\n", ")\n", "ia_normalised2" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert ia_normalised1 == ia_normalised2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLA (PyWb)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'urlkey': 'au,gov,defence)/',\n", " 'timestamp': '19981202111842',\n", " 'url': 'http://www.defence.gov.au/',\n", " 'mime': 'text/html',\n", " 'status': '200',\n", " 'digest': 'ERQQ3XVKGL4VFGI4KXIPE24QI7YMW4Z6',\n", " 'offset': '8871025',\n", " 'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-00307-000001.arc.gz',\n", " 'length': '4038',\n", " 'source': 'awa',\n", " 'source-coll': 'awa',\n", " 'is_fuzzy': '1'}]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nla_norm = query_cdx(\n", " \"nla\",\n", " \"defence.gov.au\",\n", " filter=[\"mimetype:.*pdf\", \"status:200\"],\n", " matchType=\"prefix\",\n", " limit=1,\n", ")\n", "nla_norm" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "# Test expected results\n", "assert \"mime\" in nla_norm[0].keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }