{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting all available snapshots of a particular page from the Internet Archive – Timemap or CDX?\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "There are a couple of ways of getting a list of the available snapshots for a particular url. In this notebook, we'll compare the Internet Archive's CDX index API, with their Memento Timemap API. Do they give us the same data?\n", "\n", "See [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb) for more information about the CDX API." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the data for comparison" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def query_timemap(url):\n", " '''\n", " Get a Timemap in JSON format for the specified url.\n", " '''\n", " response = requests.get(f'https://web.archive.org/web/timemap/json/{url}', headers={'User-Agent': ''})\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " '''\n", " Query the IA CDX API for the supplied url.\n", " You can optionally provide any of the parameters accepted by the API.\n", " '''\n", " params = kwargs\n", " params['url'] = url\n", " params['output'] = 'json'\n", " # User-Agent value is necessary or else IA gives an error\n", " response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "url = 'http://nla.gov.au'\n", "tm_data = query_timemap(url)\n", "tm_df = pd.DataFrame(tm_data[1:], columns=tm_data[0])\n", "cdx_data = query_cdx(url)\n", "cdx_df = pd.DataFrame(cdx_data[1:], columns=cdx_data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Are the columns the same?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'length']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(cdx_df.columns)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['urlkey',\n", " 'timestamp',\n", " 'original',\n", " 'mimetype',\n", " 'statuscode',\n", " 'digest',\n", " 'redirect',\n", " 'robotflags',\n", " 'length',\n", " 'offset',\n", " 'filename']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(tm_df.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Timemap data includes three extra columns: `robotflags`, `offset`, and `filename`. The `offset` and `filename` columns tell you where to find the snapshot, but I'm not sure what `robotflags` is for (it's not in the [specification](http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/)). Let's gave a look at what sort of values it has." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "- 2863\n", "Name: robotflags, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_df['robotflags'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's nothing in it – at least for this particular url.\n", "\n", "For my purposes, it doesn't look like the Timemap adds anything useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Do they provide the same number of snapshots?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2863, 11)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_df.shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2886, 7)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cdx_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there are more snapshots in the CDX results than the Timemap. Can we find out what they are?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlengthredirectrobotflagsoffsetfilename
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [urlkey, timestamp, original, mimetype, statuscode, digest, length, redirect, robotflags, offset, filename]\n", "Index: []" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Combine the two dataframes, then only keep rows that aren't duplicated based on timestamp, original, digest, and statuscode\n", "pd.concat([cdx_df,tm_df]).drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode'], keep=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, if there were rows in the `cdx_df` that weren't in the `tm_df` I'd expect them to show up, but there are **no** rows that aren't duplicated based on the `timestamp`, `original`, `digest`, and `statuscode` columns...\n", "\n", "Let's try this another way, by finding the number of unique shapshots in each df." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2788, 7)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove duplicate rows \n", "cdx_df.drop_duplicates(subset=['timestamp', 'digest', 'statuscode', 'original'], keep='first').shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2788, 11)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove duplicate rows \n", "tm_df.drop_duplicates(subset=['timestamp', 'digest', 'statuscode', 'original'], keep='first').shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah, so both sets of data contain duplicates, and there are really only **2,788 unique shapshots**. Let's look at some of the duplicates in the CDX data." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlength
29au,gov,nla)/19990508095540http://www2.nla.gov.au:80/text/html2006XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES1995
30au,gov,nla)/19990508095540http://www2.nla.gov.au:80/text/html2006XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES1995
31au,gov,nla)/20000229171639http://www.nla.gov.au:80/text/html200D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT2155
32au,gov,nla)/20000229171639http://www.nla.gov.au:80/text/html200D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT2155
34au,gov,nla)/20000302132843http://www.nla.gov.au:80/text/html200D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT2158
35au,gov,nla)/20000302132843http://www.nla.gov.au:80/text/html200D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT2158
61au,gov,nla)/20010301210012http://www.nla.gov.au:80/text/html200BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC2390
62au,gov,nla)/20010301210012http://www.nla.gov.au:80/text/html200BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC2390
882au,gov,nla)/20090327043759http://www.nla.gov.au/text/html200537C3S5FANRHGLW3A6WPE6A57LULWNOF6306
883au,gov,nla)/20090327043759http://www.nla.gov.au/text/html200537C3S5FANRHGLW3A6WPE6A57LULWNOF6473
\n", "
" ], "text/plain": [ " urlkey timestamp original mimetype \\\n", "29 au,gov,nla)/ 19990508095540 http://www2.nla.gov.au:80/ text/html \n", "30 au,gov,nla)/ 19990508095540 http://www2.nla.gov.au:80/ text/html \n", "31 au,gov,nla)/ 20000229171639 http://www.nla.gov.au:80/ text/html \n", "32 au,gov,nla)/ 20000229171639 http://www.nla.gov.au:80/ text/html \n", "34 au,gov,nla)/ 20000302132843 http://www.nla.gov.au:80/ text/html \n", "35 au,gov,nla)/ 20000302132843 http://www.nla.gov.au:80/ text/html \n", "61 au,gov,nla)/ 20010301210012 http://www.nla.gov.au:80/ text/html \n", "62 au,gov,nla)/ 20010301210012 http://www.nla.gov.au:80/ text/html \n", "882 au,gov,nla)/ 20090327043759 http://www.nla.gov.au/ text/html \n", "883 au,gov,nla)/ 20090327043759 http://www.nla.gov.au/ text/html \n", "\n", " statuscode digest length \n", "29 200 6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES 1995 \n", "30 200 6XT5KK4SHMW2N3CAHPACCPJ2J5ZUIMES 1995 \n", "31 200 D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT 2155 \n", "32 200 D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT 2155 \n", "34 200 D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT 2158 \n", "35 200 D6ZAUEL66NTHWCW3UP6K7WBT3CJLRYTT 2158 \n", "61 200 BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC 2390 \n", "62 200 BQJYH5FYDY6DEAXDZMSZWKLZXA5GGJNC 2390 \n", "882 200 537C3S5FANRHGLW3A6WPE6A57LULWNOF 6306 \n", "883 200 537C3S5FANRHGLW3A6WPE6A57LULWNOF 6473 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dupes = cdx_df.loc[cdx_df.duplicated(subset=['timestamp', 'digest'], keep=False)].sort_values(by='timestamp')\n", "dupes.head(10)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Date range of duplicates: 19990508095540 to 20200517065916\n" ] } ], "source": [ "print(f'Date range of duplicates: {dupes[\"timestamp\"].min()} to {dupes[\"timestamp\"].max()}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So it seems they provide the same number of unique snapshots, but the CDX index adds a few more duplicates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Is there a difference in speed?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.22 s ± 453 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "tm_data = query_timemap(url)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "5.06 s ± 2.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "cdx_data = query_cdx(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both methods provide much the same data, so it just comes down to convenience and performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io).\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }