{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting all available snapshots of a particular page from the Internet Archive – Timemap or CDX?\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "There are a couple of ways of getting a list of the available snapshots for a particular url. In this notebook, we'll compare the Internet Archive's CDX index API, with their Memento Timemap API. Do they give us the same data?\n", "\n", "See [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb) for more information about the CDX API." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the data for comparison" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def query_timemap(url):\n", " \"\"\"\n", " Get a Timemap in JSON format for the specified url.\n", " \"\"\"\n", " response = requests.get(\n", " f\"https://web.archive.org/web/timemap/json/{url}\", headers={\"User-Agent\": \"\"}\n", " )\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " \"\"\"\n", " Query the IA CDX API for the supplied url.\n", " You can optionally provide any of the parameters accepted by the API.\n", " \"\"\"\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " # User-Agent value is necessary or else IA gives an error\n", " response = requests.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "url = \"http://nla.gov.au\"\n", "tm_data = query_timemap(url)\n", "tm_df = pd.DataFrame(tm_data[1:], columns=tm_data[0])\n", "cdx_data = query_cdx(url)\n", "cdx_df = pd.DataFrame(cdx_data[1:], columns=cdx_data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Are the columns the same?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'length']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(cdx_df.columns)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(tm_df.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Timemap data includes three extra columns: `robotflags`, `offset`, and `filename`. The `offset` and `filename` columns tell you where to find the snapshot, but I'm not sure what `robotflags` is for (it's not in the [specification](http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/)). Let's gave a look at what sort of values it has." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "- 4404\n", "Name: robotflags, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_df[\"robotflags\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's nothing in it – at least for this particular url.\n", "\n", "For my purposes, it doesn't look like the Timemap adds anything useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Do they provide the same number of snapshots?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4404, 11)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_df.shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4405, 7)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdx_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there are more snapshots in the CDX results than the Timemap. Can we find out what they are?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "redirect | \n", "robotflags | \n", "offset | \n", "filename | \n", "
---|
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "
---|---|---|---|---|---|---|---|
878 | \n", "au,gov,nla)/ | \n", "20090327043759 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "537C3S5FANRHGLW3A6WPE6A57LULWNOF | \n", "6306 | \n", "
879 | \n", "au,gov,nla)/ | \n", "20090327043759 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "537C3S5FANRHGLW3A6WPE6A57LULWNOF | \n", "6473 | \n", "
880 | \n", "au,gov,nla)/ | \n", "20090515004007 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "CC747V3CYGCYQZELL37KNOW5DRPEMFEW | \n", "6614 | \n", "
881 | \n", "au,gov,nla)/ | \n", "20090515004007 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "CC747V3CYGCYQZELL37KNOW5DRPEMFEW | \n", "6614 | \n", "
883 | \n", "au,gov,nla)/ | \n", "20090521102300 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "25VWCDZDMMC57PLHGKIJ6XUBG566EW33 | \n", "6619 | \n", "
884 | \n", "au,gov,nla)/ | \n", "20090521102300 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "25VWCDZDMMC57PLHGKIJ6XUBG566EW33 | \n", "6619 | \n", "
885 | \n", "au,gov,nla)/ | \n", "20090521230410 | \n", "http://nla.gov.au/ | \n", "warc/revisit | \n", "- | \n", "BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD | \n", "469 | \n", "
886 | \n", "au,gov,nla)/ | \n", "20090521230410 | \n", "http://nla.gov.au/ | \n", "warc/revisit | \n", "- | \n", "BDOBBSVBWA4WL3PLC7TSVIA5PE2RZKRD | \n", "469 | \n", "
887 | \n", "au,gov,nla)/ | \n", "20090528133919 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "IBVDKIMFCMXC3HU6RFHJOXEOGRKASMTM | \n", "6755 | \n", "
888 | \n", "au,gov,nla)/ | \n", "20090528133919 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "IBVDKIMFCMXC3HU6RFHJOXEOGRKASMTM | \n", "6755 | \n", "