{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting all available snapshots of a particular page from the Internet Archive – Timemap or CDX?\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "There are a couple of ways of getting a list of the available snapshots for a particular url. In this notebook, we'll compare the Internet Archive's CDX index API, with their Memento Timemap API. Do they give us the same data?\n", "\n", "See [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb) for more information about the CDX API." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the data for comparison" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def query_timemap(url):\n", " '''\n", " Get a Timemap in JSON format for the specified url.\n", " '''\n", " response = requests.get(f'https://web.archive.org/web/timemap/json/{url}', headers={'User-Agent': ''})\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " '''\n", " Query the IA CDX API for the supplied url.\n", " You can optionally provide any of the parameters accepted by the API.\n", " '''\n", " params = kwargs\n", " params['url'] = url\n", " params['output'] = 'json'\n", " # User-Agent value is necessary or else IA gives an error\n", " response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "url = 'http://nla.gov.au'\n", "tm_data = query_timemap(url)\n", "tm_df = pd.DataFrame(tm_data[1:], columns=tm_data[0])\n", "cdx_data = query_cdx(url)\n", "cdx_df = pd.DataFrame(cdx_data[1:], columns=cdx_data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Are the columns the same?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'length']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(cdx_df.columns)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(tm_df.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Timemap data includes three extra columns: `robotflags`, `offset`, and `filename`. The `offset` and `filename` columns tell you where to find the snapshot, but I'm not sure what `robotflags` is for (it's not in the [specification](http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/)). Let's gave a look at what sort of values it has." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "- 2863\n", "Name: robotflags, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_df['robotflags'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's nothing in it – at least for this particular url.\n", "\n", "For my purposes, it doesn't look like the Timemap adds anything useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Do they provide the same number of snapshots?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2863, 11)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_df.shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2886, 7)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdx_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there are more snapshots in the CDX results than the Timemap. Can we find out what they are?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "redirect | \n", "robotflags | \n", "offset | \n", "filename | \n", "
---|
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "
---|---|---|---|---|---|---|---|
29 | \n", "au,gov,nla)/ | \n", "19990508095540 | \n", "http://www2.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES | \n", "1995 | \n", "
30 | \n", "au,gov,nla)/ | \n", "19990508095540 | \n", "http://www2.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES | \n", "1995 | \n", "
31 | \n", "au,gov,nla)/ | \n", "20000229171639 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | \n", "2155 | \n", "
32 | \n", "au,gov,nla)/ | \n", "20000229171639 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | \n", "2155 | \n", "
34 | \n", "au,gov,nla)/ | \n", "20000302132843 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | \n", "2158 | \n", "
35 | \n", "au,gov,nla)/ | \n", "20000302132843 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT | \n", "2158 | \n", "
61 | \n", "au,gov,nla)/ | \n", "20010301210012 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC | \n", "2390 | \n", "
62 | \n", "au,gov,nla)/ | \n", "20010301210012 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC | \n", "2390 | \n", "
882 | \n", "au,gov,nla)/ | \n", "20090327043759 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "537C3S5FANRHGLW3A6WPE6A57LULWNOF | \n", "6306 | \n", "
883 | \n", "au,gov,nla)/ | \n", "20090327043759 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "537C3S5FANRHGLW3A6WPE6A57LULWNOF | \n", "6473 | \n", "