{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring the Internet Archive's CDX API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CDX APIs provide access to an index of resources captured by web archives. The results can be filtered in a number of ways, making them a convenient way of harvesting and exploring the holdings of web archives. This notebook focuses on the data you can obtain from the Internet Archives' CDX API. For more information on the differences between this and other CDX APIs see [Comparing CDX APIs](comparing_cdx_apis.ipynb). To examine differences between CDX data and Timemaps see [Timemaps vs CDX APIs](getting_all_snapshots_timemap_vs_cdx.ipynb).\n", "\n", "Notebooks demonstrating ways of getting and using CDX data include:\n", "\n", "* [Find all the archived versions of a web page](find_all_captures.ipynb)\n", "* [Harvesting collections of text from archived web pages](getting_text_from_web_pages.ipynb)\n", "* [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)\n", "* [Find and explore Powerpoint presentations from a specific domain](explore_presentations.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "vegafusion.enable(mimetype='html', row_limit=30000, embed_options=None)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "from base64 import b32encode\n", "from hashlib import sha1\n", "\n", "import altair as alt\n", "import arrow\n", "import pandas as pd\n", "import requests\n", "import vegafusion as vf\n", "from tqdm.auto import tqdm\n", "\n", "vf.enable(row_limit=30000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Useful resources\n", "\n", "* [Wayback Machine APIs](https://archive.org/help/wayback_api.php)\n", "* [Wayback CDX API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server)\n", "* [Archive-it's CDX/C API](https://support.archive-it.org/hc/en-us/articles/115001790023-Access-Archive-It-s-Wayback-index-with-the-CDX-C-API) – includes useful general documentation of CDX format\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Your first CDX request\n", "\n", "Let's have a look at the sort of data the CDX server gives us. At the very least, we have to provide a `url` parameter to point to a particular page (or domain as we'll see below). To avoid flinging too much data about, we'll also add a `limit` parameter that tells the CDX server how many rows of data to give us." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135\n", "au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138\n", "au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603\n", "au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457\n", "au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141\n", "au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603\n", "au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126\n", "au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140\n", "au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603\n", "au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123\n", "\n" ] } ], "source": [ "# 8 April 2020 -- without the 'User-Agent' header parameter I get a 445 error\n", "# 27 April 2020 - now seems ok without changing User-Agent\n", "\n", "# Feel free to change these values\n", "params1 = {\"url\": \"http://nla.gov.au\", \"limit\": 10}\n", "\n", "# Get the data and print the results\n", "response = requests.get(\"https://web.archive.org/cdx/search/cdx\", params=params1)\n", "print(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the results are returned in a simple text format – fields are separated by spaces, and each result is on a separate line. It's a bit hard to read in this format, so let's add the `output` parameter to get the results in JSON format. We'll then use Pandas to display the results in a table." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlength
0au,gov,nla)/19961019064223http://www.nla.gov.au:80/text/html200M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI1135
1au,gov,nla)/19961221102755http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE1138
2au,gov,nla)/19961221132358http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA603
3au,gov,nla)/19961223031839http://www2.nla.gov.au:80/text/html2006XHDP66AXEPMVKVROHHDN6CPZYHZICEX457
4au,gov,nla)/19970212053405http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE1141
5au,gov,nla)/19970215222554http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA603
6au,gov,nla)/19970315230640http://www.nla.gov.au:80/text/html200NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB1126
7au,gov,nla)/19970315230640http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE1140
8au,gov,nla)/19970413005246http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA603
9au,gov,nla)/19970418074154http://www.nla.gov.au:80/text/html200NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB1123
\n", "
" ], "text/plain": [ " urlkey timestamp original mimetype \\\n", "0 au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html \n", "1 au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html \n", "2 au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html \n", "3 au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html \n", "4 au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html \n", "5 au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html \n", "6 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html \n", "7 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html \n", "8 au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html \n", "9 au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html \n", "\n", " statuscode digest length \n", "0 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135 \n", "1 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138 \n", "2 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 \n", "3 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457 \n", "4 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141 \n", "5 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 \n", "6 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126 \n", "7 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140 \n", "8 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 \n", "9 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "params2 = {\"url\": \"http://nla.gov.au\", \"limit\": 10, \"output\": \"json\"}\n", "\n", "# Get the data and print the results\n", "response = requests.get(\"http://web.archive.org/cdx/search/cdx\", params=params2)\n", "results = response.json()\n", "\n", "# Use Pandas to turn the results into a DataFrame then display\n", "pd.DataFrame(results[1:], columns=results[0]).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The JSON results are, in Python terms, a list of lists, rather than a list of dictionaries. The first of these lists contains the field names. If you look at the line below, you'll see that we use the first list (`results[0]`) to set the column names in the dataframe, while the rest of the data (`results[1:]`) makes up the rows.\n", "\n", "``` python\n", "pd.DataFrame(results[1:], columns=results[0]).head(10)\n", "```\n", "\n", "Let's have a look at the fields.\n", "\n", "* `urlkey` – the page url expressed as a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt) (Sort-friendly URI Reordering Transform)\n", "* `timestamp` – the date and time of the capture in a `YYYYMMDDhhmmss` format\n", "* `original` – the url that was captured\n", "* `mimetype` – the type of file captured, expressed in a [standard format](https://en.wikipedia.org/wiki/Media_type)\n", "* `statuscode` – a [standard code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) provided by the web server that reports on the result of the capture request\n", "* `digest` – also known as a 'checksum' or 'fingerprint', the digest provides an [algorithmically generated](https://en.wikipedia.org/wiki/Cryptographic_hash_function) string that uniquely identifies the content of the captured url\n", "* `length` – the size of the captured content in bytes (compressed on disk)\n", "\n", "All makes perfect sense right? Hmmm, we'll dig a little deeper below, but first..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requesting a particular capture\n", "\n", "We can use the `timestamp` value to retrieve the contents of a particular capture. A url like this will open the captured resource in the Wayback Machine:\n", "```\n", "https://web.archive.org/web/[timestamp]/[url]\n", "```\n", "\n", "For example: https://web.archive.org/web/20130201130329/http://www.nla.gov.au/\n", "\n", "If you want the original contents, without the modifications and navigation added by the Wayback Machine, just add `id_` after the `timestamp`:\n", "\n", "```\n", "https://web.archive.org/web/[timestamp]id_/[url]\n", "```\n", "\n", "For example: https://web.archive.org/web/20130201130329id_/http://www.nla.gov.au/\n", "\n", "You'll probably notice that the original version doesn't look very pretty because links to CSS or Javascript files are still pointing to their old, broken, addresses. If you want a version without the Wayback Machine Navigation, but *with* urls to any linked files rewritten to point to archived versions, then add `if_` after the timestamp. \n", "\n", "``` \n", "https://web.archive.org/web/[timestamp]if_/[url]\n", "```\n", "\n", "For example: https://web.archive.org/web/20130201130329if_/http://www.nla.gov.au/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting all the captures of a particular page\n", "\n", "If you want to get all the captures of a particular page, you can just leave out the `limit` parameter. However, there is (supposedly) a limit on the number of results returned in a single request. The API documentation says the current limit is 150,000, but it seems much larger – if you ask for `cnn.com` without using `limit` you get more than 290,000 results! To try and make sure that you're getting everything, there's a couple of ways you can break up the results set into chunks. The first is to set the `showResumeKey` parameter to `true`. Then, if there are more results available than are returned in your initial request, a couple of extra rows of data will be added to your results. The last row will include a resumption key, while the second last row will be empty, for example:\n", "\n", "``` json\n", " [], \n", " ['com%2Ccnn%29%2F+20000621011732']\n", "```\n", "\n", "You then set the `resumeKey` parameter to the value of the resumption key, and add it to your next requests. You can combine the use of the resumption key with the `limit` paramater to break a large collection of captures into manageable chunks.\n", "\n", "The other way is to add a `page` parameter, starting at `0` then incrementing the `page` value by one until you've worked through the complete set of results. But how do you know the total number of pages? If you add `showNumPages=true` to your query, the server will return a single number representing the total pages. But the pages themselves come from a special index and can contain different numbers of results depending on your query, so there's no obvious way to calculate the number of captures from the number of pages. Also, the maximum size of a page seems quite large and this sometimes causes errors. You can control this by adding a `pageSize` parameter. The meaning of this value seems a bit mysterious, but I've found that a `pageSize` of `5` seems to be a reasonable balance between the amount of data returned by each requests, and the number of requests you have to make.\n", "\n", "Let's put all this together in a few functions that will help us construct CDX queries of any size or complexity." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def check_for_resumption_key(results):\n", " \"\"\"\n", " Checks to see if the second-last row is an empty list,\n", " if it is, return the last value as the resumption key.\n", " \"\"\"\n", " if not results[-2]:\n", " return results[-1][0]\n", "\n", "\n", "def get_total_pages(params):\n", " \"\"\"\n", " Gets the total number of pages in a set of results.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"showNumPages\"] = \"true\"\n", " response = requests.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=these_params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " return int(response.text)\n", "\n", "\n", "def prepare_params(url, use_resume_key=False, **kwargs):\n", " \"\"\"\n", " Prepare the parameters for a CDX API requests.\n", " Adds all supplied keyword arguments as parameters (changing from_ to from).\n", " Adds in a few necessary parameters and showResumeKey if requested.\n", " \"\"\"\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " if use_resume_key:\n", " params[\"showResumeKey\"] = \"true\"\n", " # CDX accepts a 'from' parameter, but this is a reserved word in Python\n", " # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.\n", " if \"from_\" in params:\n", " params[\"from\"] = params[\"from_\"]\n", " del params[\"from_\"]\n", " return params\n", "\n", "\n", "def get_cdx_data(params):\n", " \"\"\"\n", " Make a request to the CDX API using the supplied parameters.\n", " Check the results for a resumption key, and return the key (if any) and the results.\n", " \"\"\"\n", " response = requests.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " response.raise_for_status()\n", " results = response.json()\n", " resumption_key = check_for_resumption_key(results)\n", " # Remove the resumption key from the results\n", " if resumption_key:\n", " results = results[:-2]\n", " return resumption_key, results\n", "\n", "\n", "def query_cdx_by_page(url, **kwargs):\n", " all_results = []\n", " page = 0\n", " params = prepare_params(url, **kwargs)\n", " total_pages = get_total_pages(params)\n", " with tqdm(total=total_pages - page) as pbar1:\n", " with tqdm() as pbar2:\n", " while page < total_pages:\n", " params[\"page\"] = page\n", " _, results = get_cdx_data(params)\n", " if page == 0:\n", " all_results += results\n", " else:\n", " all_results += results[1:]\n", " page += 1\n", " pbar1.update(1)\n", " pbar2.update(len(results) - 1)\n", " return all_results\n", "\n", "\n", "def query_cdx_with_key(url, **kwargs):\n", " \"\"\"\n", " Harvest results from the CDX API using the supplied parameters.\n", " Uses showResumeKey to check if there are more than one page of results,\n", " and if so loops through pages until all results are downloaded.\n", " \"\"\"\n", " params = prepare_params(url, use_resume_key=True, **kwargs)\n", " with tqdm() as pbar:\n", " # This will include the header row\n", " resumption_key, all_results = get_cdx_data(params)\n", " pbar.update(len(all_results) - 1)\n", " while resumption_key is not None:\n", " params[\"resumeKey\"] = resumption_key\n", " resumption_key, results = get_cdx_data(params)\n", " # Remove the header row and add\n", " all_results += results[1:]\n", " pbar.update(len(results) - 1)\n", " return all_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To harvest all of the captures of 'http://www.nla.gov.au', you can just call:\n", "\n", "``` python\n", "results = query_cdx_with_key('http://www.nla.gov.au')\n", "```\n", "\n", "To break the harvest down into chunks of 1,000 results at a time, you'd call:\n", "\n", "``` python\n", "results = query_cdx_with_key('http://www.nla.gov.au', limit=1000)\n", "```\n", "\n", "There are a number of other parameters you can use to filter results from the CDX API, you can supply any of these as well. We'll see some examples below.\n", "\n", "So let's get all the captures of 'http://www.nla.gov.au'." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "95be38d65f1745d185b6faca71fd4ac5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "results = query_cdx_with_key(\"http://www.nla.gov.au\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And convert them into a dataframe." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results[1:], columns=results[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many captures are there?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(29691, 7)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, now we've got a dataset, let's look at the structure of the data in a little more detail." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CDX data in depth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SURTs, urlkeys, & urls\n", "\n", "As noted above, the `urlkey` field contains things that are technically known as SURTs (Sort-friendly URI Reordering Transform). Basically, the order of components in the url's domain are reversed to make captures easier to sort and group. So instead of `nla.gov.au` we have `au,gov,nla`. The path component of the url, the bit that points to a specific file within the domain, is tacked on the end of the `urlkey` after a closing bracket. Here are some examples:\n", "\n", "`http://www.nla.gov.au` becomes `au,gov,nla` plus the path `/`, so the urlkey is: \n", "\n", "```\n", "au,gov,nla)/\n", "```\n", " \n", "`http://www.dsto.defence.gov.au/attachments/9%20LEWG%20Oct%2008%20DEU.ppt` becomes `au,gov,defence,dsto` plus the path `/attachments/9%20lewg%20oct%2008%20deu.ppt`, so the urlkey is:\n", "\n", "```\n", "au,gov,defence,dsto)/attachments/9%20lewg%20oct%2008%20deu.ppt\n", "```\n", "\n", "From the examples above, you'll notice there's a bit of extra normalisation going on. For example, the url components are all converted to lowercase. You might also be wondering what happened to the `www` subdomain. By convention these are aliases that just point to the underlying domain – `www.nla.gov.au` ends up at the same place as `nla.gov.au` – so they're removed from the SURT. We can explore this a bit further by comping the `original` urls in our dataset to the `urlkeys`.\n", "\n", "How many unique `urlkey`s are there? Hopefully just one, as we're gathering captures from a single url!" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"urlkey\"].unique().shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But how many different `original` urls were captured?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"original\"].unique().shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at them." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http://nla.gov.au/ 25001\n", "https://www.nla.gov.au/ 2005\n", "http://www.nla.gov.au/ 1458\n", "http://www.nla.gov.au:80/ 868\n", "https://nla.gov.au/ 187\n", "http://nla.gov.au:80/ 77\n", "https://www.nla.gov.au 30\n", "http://www.nla.gov.au// 23\n", "http://www.nla.gov.au 11\n", "http://www2.nla.gov.au:80/ 10\n", "http://Trove@nla.gov.au/ 6\n", "http://www.nla.gov.au./ 4\n", "http://www.nla.gov.au:80/? 2\n", "http://nla.gov.au 2\n", "http://www.nla.gov.au/? 1\n", "http://cmccarthy@nla.gov.au/ 1\n", "http://mailto:media@nla.gov.au/ 1\n", "http://mailto:development@nla.gov.au/ 1\n", "http://mailto:www@nla.gov.au/ 1\n", "http://www.nla.gov.au:80// 1\n", "https://www.nla.gov.au// 1\n", "Name: original, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"original\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can see that as well as removing `www`, the normalisation process removes `www2` and port numbers, and groups together the `http` and `https` protocols. There's also some odd things that look like email addresses and were probably harvested by mistake from `mailto` links." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But wait a minute, our original query was just for the url `http://nla.gov.au`, why did we get all these other urls? When we request a particular `url` from the CDX API, it matches results based on the url's SURT, not on the original url. This ensures that we get all the variations in the way the url might be expressed. If we want to limit results to a specific form of the url, we can do that by filtering on the `original` field, as we'll see below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the `urlkey` is essentially a normalised identifier for an individual url, you can use it to group together all the captures of individual pages across a whole domain. For example, if wanted to know how many urls have been captured from the `nla.gov.au` domain, we can call our query function like this:\n", "\n", "``` python\n", "results = query_cdx('nla.gov.au/*', collapse='urlkey', limit=1000)\n", "```\n", "\n", "Note that the `url` parameter includes a `*` to indicate that we want everything under the `nla.gov.au` domain. The `collapse='urlkey'` parameter says that we only want unique `urlkey` values – so we'll get just one capture for each individual url within the `nla.gov.au` domain. This can be a useful way of gathering a domain-level summary.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Timestamps\n", "\n", "The `timestamp` field is pretty straightforward, it contains the date and time of the capture expressed in the format `YYYYMMDDhhmmss`. Once we have the harvested results in a dataframe, we can easily convert the timestamps into a datetime object." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df[\"date\"] = pd.to_datetime(df[\"timestamp\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can extract the year and calculate the number of captures per year." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df[\"year\"] = df[\"timestamp\"].str.slice(0, 4)\n", "df_years = df[\"year\"].value_counts().to_frame().reset_index()\n", "df_years.columns = [\"year\", \"count\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes it possible to plot the number of captures over time." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df_years).mark_bar().encode(x=\"year\", y=\"count:Q\").properties(height=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the `timestamp` field to filter results by date using the `from` and `to` parameters. For example, to get results from the year 2000 you'd use `from=20000101` and `to=20001231`. However, if you're using my functions above, you'll need to use `from_` rather than `from` as `from` is a reserved word in Python. The function will change it back before sending to the CDX server.\n", "\n", "The `timestamp` field can also be used with the `collapse` parameter. If you include `collapse=timestamp:4`, the server will look at the first four digits of the `timestamp` – ie the year – and only include the first capture from that year. Similarly, `collapse=timestamp:8` should give you a maximum of one capture per hour. In reality, `collapse` is dependent on the order of results and doesn't work perfectly – so you probably want to check your results for duplicates (Pandas `.drop_duplicates()` makes this easy).\n", "\n", "Let's test it out – if it works we should end up with a very boring bar chart showing one result per year..." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "018ae14805174d2cb18119e843059f10", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the data - note the `collapse` parameter\n", "results_ts = query_cdx_with_key(\"http://www.nla.gov.au\", collapse=\"timestamp:4\")\n", "\n", "# Convert to dataframe\n", "df_ts = pd.DataFrame(results_ts[1:], columns=results_ts[0])\n", "\n", "# Convert timestamp to date\n", "df_ts[\"date\"] = pd.to_datetime(df_ts[\"timestamp\"])\n", "\n", "# Chart number of results per year\n", "alt.Chart(df_ts).mark_bar().encode(x=\"year(date):T\", y=\"count()\").properties(\n", " width=700, height=200\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Original\n", "\n", "As noted above, the `original` field includes the actual url that was captured. You can use the `filter` parameter and regular expressions to limit your results using the `original` value. For example, to only get urls including `www` you could use `filter=original:https*://www.*`. Let's give it a try. Compare the results produced here to those above." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "68e2de6bebe64ade93a21a98bb8b5e15", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "https://www.nla.gov.au/ 2005\n", "http://www.nla.gov.au/ 1458\n", "http://www.nla.gov.au:80/ 868\n", "https://www.nla.gov.au 30\n", "http://www.nla.gov.au// 23\n", "http://www.nla.gov.au 11\n", "http://www2.nla.gov.au:80/ 10\n", "http://www.nla.gov.au./ 4\n", "http://www.nla.gov.au:80/? 2\n", "http://www.nla.gov.au:80// 1\n", "http://www.nla.gov.au/? 1\n", "https://www.nla.gov.au// 1\n", "Name: original, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_o = query_cdx_with_key(\n", " \"http://www.nla.gov.au\", filter=\"original:https*://www.*\"\n", ")\n", "\n", "# Convert to dataframe\n", "df_o = pd.DataFrame(results_o[1:], columns=results_o[0])\n", "\n", "df_o[\"original\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mimetypes\n", "\n", "The `mimetype` indicates the type of file captured. There's a long list of [recognised media types](https://www.iana.org/assignments/media-types/media-types.xhtml), but you're only likely to meet a small subset of these in a web archive. The most common, of course, will be `text/html`, but there will also be the various image formats, CSS and Javascript files, and other formats shared via the web like PDFs and Powerpoint files.\n", "\n", "If you're no interested in all the extra bits and pieces, like CSS and Javascript, that make up a web page, you might want to use the `filter` parameter to limit your query results to `text/html`. You can also use regular expressions with `filter`, so if you can't be bothered entering all the possible mimtypes for Powerpoint presentations, you could try something like `filter=['mimetype:.*(powerpoint|presentation).*']`. This uses a regular expression to look for mimetype values that contain either 'powerpoint' or 'presentation'. Let's give it a try:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e9b4365c6ec8425f884c76e2270780d8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "application/vnd.openxmlformats-officedocument.presentationml.presentation 123\n", "application/vnd.ms-powerpoint.presentation.12 82\n", "application/vnd.openxmlformats-officedocument.presentationml.slideshow 3\n", "application/vnd.ms-powerpoint.show.12 3\n", "Name: mimetype, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_m = query_cdx_with_key(\n", " \"*.education.gov.au\", filter=[\"mimetype:.*(powerpoint|presentation).*\"]\n", ")\n", "df_m = pd.DataFrame(results_m[1:], columns=results_m[0])\n", "df_m[\"mimetype\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One thing you might notice is that sometimes the `mimetype` value doesn't seem to match the file extension. Let's try looking for captures with a `text/html` mimetype, where the `original` value ends in 'pdf'. We can do this by combining the filters `mimetype:text/html` and `original:.*\\.pdf$`. Note that we're using a regular expression to find the '.pdf' extension." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6d063c6f8d514bbf92b255585d8f936f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlength
0au,gov,naa)/%20recordkeeping/dirks/dirksman/di...20230316090540https://www.naa.gov.au/%20recordkeeping/dirks/...text/html404QEBKDGBNMPKFQIDHH36SSN2ESXBDRCOB919
1au,gov,naa)/.../an-approach-green-paper_tcm16-...20140924080305http://www.naa.gov.au/.../An-approach-Green-Pa...text/html302RBAUTMMEDESHYHSQ5PCUWUILGZSLFOIR902
2au,gov,naa)/.../digital-preservation-software-...20141011134912http://www.naa.gov.au/.../Digital-Preservation...text/html3022EKIQ2YLXTDK5CS4VPFJMEZGNMATFEDG913
3au,gov,naa)/.../holt.pdf20141010120041http://www.naa.gov.au/.../holt.pdftext/html302GDGHFNKCSTNJENDMGLHQWNVXUIAIUZYN880
4au,gov,naa)/.../horrie_tcm16-36799.pdf20141011001020http://www.naa.gov.au/.../horrie_tcm16-36799.pdftext/html302IMGTBT5B33WIEHQ7E6MDMIHBW3UMI2MQ889
\n", "
" ], "text/plain": [ " urlkey timestamp \\\n", "0 au,gov,naa)/%20recordkeeping/dirks/dirksman/di... 20230316090540 \n", "1 au,gov,naa)/.../an-approach-green-paper_tcm16-... 20140924080305 \n", "2 au,gov,naa)/.../digital-preservation-software-... 20141011134912 \n", "3 au,gov,naa)/.../holt.pdf 20141010120041 \n", "4 au,gov,naa)/.../horrie_tcm16-36799.pdf 20141011001020 \n", "\n", " original mimetype statuscode \\\n", "0 https://www.naa.gov.au/%20recordkeeping/dirks/... text/html 404 \n", "1 http://www.naa.gov.au/.../An-approach-Green-Pa... text/html 302 \n", "2 http://www.naa.gov.au/.../Digital-Preservation... text/html 302 \n", "3 http://www.naa.gov.au/.../holt.pdf text/html 302 \n", "4 http://www.naa.gov.au/.../horrie_tcm16-36799.pdf text/html 302 \n", "\n", " digest length \n", "0 QEBKDGBNMPKFQIDHH36SSN2ESXBDRCOB 919 \n", "1 RBAUTMMEDESHYHSQ5PCUWUILGZSLFOIR 902 \n", "2 2EKIQ2YLXTDK5CS4VPFJMEZGNMATFEDG 913 \n", "3 GDGHFNKCSTNJENDMGLHQWNVXUIAIUZYN 880 \n", "4 IMGTBT5B33WIEHQ7E6MDMIHBW3UMI2MQ 889 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_m2 = query_cdx_with_key(\n", " \"naa.gov.au/*\", filter=[\"mimetype:text/html\", r\"original:.*\\.pdf$\"]\n", ")\n", "df_m2 = pd.DataFrame(results_m2[1:], columns=results_m2[0])\n", "df_m2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It certainly looks a bit weird, but if we look at the status codes we see that most of these captures are actually redirections or errors, so the server's response is HTML even though file requested was a PDF. We'll look more at status codes below." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "404 938\n", "302 620\n", "200 7\n", "Name: statuscode, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_m2[\"statuscode\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status code\n", "\n", "This is a standard code used by web servers to indicate the result of a file request. A code of `200` indicates everything was ok and the requested file was delivered. A code of `404` means the requested file couldn't be found. Let's look at all the status codes received when attempting to capture `nla.gov.au`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "301 19256\n", "- 6558\n", "200 3459\n", "302 415\n", "503 3\n", "Name: statuscode, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"statuscode\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we'd expect, most were ok (`200`), but there were a couple of server errors (`503`). The `-` is not a standard status code, it's used in the archiving process to indicate that a duplicate of the file already exists in the archive – these captures also have a `mimetype` of `warc/revisit`. The `301` and `302` codes indicate that the original request was redirected. I look at this in more detail in another notebook, but it's worth thinking for a minute about what redirects are, how they are captured, and how they are played back by the Wayback Machine.\n", "\n", "Sometimes files get moved around on web servers. To avoid a lot of 'not found' errors, servers can be configured to respond to requests for the old addresses with a `301` or `302` response that includes the new address. Browsers can then load the new address automatically without you even knowing that the page has moved. It's these exchanges between the server and browser (or web archiving bot) that are being captured and presented through the CDX archive.\n", "\n", "When you try to look at one of these captures in the Wayback Machine, the captured redirect does what redirects are supposed to do and sends you off to the new address. However, in this case you're redirected to an archived version of the file at the new address from *about the same time* as the redirect was captured. The Wayback Machine does this by looking for the capture from the new address that is closest in date to the date of the redirect. There's no guarantee that the new address was captured immediately after the redirect was received, as happens in a normal web browser. As a result, the redirect might take you back or forward in time. Let's try an experiment. Here we take to first twenty `302` responses from `nla.gov.au` and compare the `timestamp` of the captured redirect with the `timestamp` of the page we're actually redirected to by the Wayback Machine." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "57e36e4c796c407894a54ba61e21ed03", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "12 hours 13 minutes and 8 seconds later\n", "0 hours 0 minutes and 0 seconds earlier\n", "98 hours 16 minutes and 30 seconds later\n", "15 hours 12 minutes and 8 seconds later\n", "2 hours 6 minutes and 35 seconds earlier\n", "2 hours 2 minutes and 31 seconds earlier\n", "8 hours 7 minutes and 21 seconds earlier\n", "3 hours 43 minutes and 9 seconds later\n", "an hour 6 minutes and 56 seconds later\n", "7 hours 54 minutes and 47 seconds earlier\n" ] } ], "source": [ "results_s = query_cdx_with_key(\"nla.gov.au\", filter=\"statuscode:302\")\n", "for capture in results_s[1:11]:\n", " timestamp = capture[1]\n", " redirect_date = arrow.get(timestamp, \"YYYYMMDDHHmmss\")\n", " response = requests.get(f\"https://web.archive.org/web/{timestamp}id_/{capture[2]}\")\n", " capture_timestamp = re.search(r\"web\\/(\\d{14})\", response.url).group(1)\n", " capture_date = arrow.get(capture_timestamp, \"YYYYMMDDHHmmss\")\n", " direction = \"later\" if capture_date > redirect_date else \"earlier\"\n", " print(\n", " f'{redirect_date.humanize(other=capture_date, granularity=[\"hour\", \"minute\", \"second\"], only_distance=True)} {direction}'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does this matter? Probably not, but it's something to be aware of. When we're using something like the Wayback Machine it can seem like we're accessing the live web, but we're not – what we're seeing is an attempt to reconstruct a version of the live web from available captures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Digest\n", "\n", "The `digest` is an algorithmically generated string that uniquely identifies the contents of the captured url. It's like the file's fingerprint, and it helps us to see when things change. It seems weird that you can represent the complete contents of a file in a short string, but there's nothing too mysterious about it. To create the digests, files are first encrypted using the [SHA-1 hash function](https://en.wikipedia.org/wiki/SHA-1), then these strings are encoded as [Base 32](https://en.wikipedia.org/wiki/Base32). Try it!" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3VQDI552JQRW5ROPWTSKINAWFWGWQ6CQ\n" ] } ], "source": [ "print(b32encode(sha1(\"This is a string.\".encode()).digest()).decode())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One interesting thing about digests is that small changes to a page can result in very different digests. Let's try adding an exclamation mark to the string above." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MWTI7PY7WJDIBYQKZ2P2Y5UA75UWOSYR\n" ] } ], "source": [ "print(b32encode(sha1(\"This is a string!\".encode()).digest()).decode())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Completely changed! So while digests can tell you two files are different, they can't tell you *how* different." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the `digest` field with the `collapse` parameter to filter out identical captures, but this only works if the captures are next to each other in the index. As noted above, if you wanted to remove all duplicates, you'd probably need to use Pandas to process the harvested results.\n", "\n", "If we look again at our initial harvest you might notice something odd." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlengthdateyear
0au,gov,nla)/19961019064223http://www.nla.gov.au:80/text/html200M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI11351996-10-19 06:42:231996
1au,gov,nla)/19961221102755http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE11381996-12-21 10:27:551996
2au,gov,nla)/19961221132358http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA6031996-12-21 13:23:581996
3au,gov,nla)/19961223031839http://www2.nla.gov.au:80/text/html2006XHDP66AXEPMVKVROHHDN6CPZYHZICEX4571996-12-23 03:18:391996
4au,gov,nla)/19970212053405http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE11411997-02-12 05:34:051997
5au,gov,nla)/19970215222554http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA6031997-02-15 22:25:541997
6au,gov,nla)/19970315230640http://www.nla.gov.au:80/text/html200NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB11261997-03-15 23:06:401997
7au,gov,nla)/19970315230640http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE11401997-03-15 23:06:401997
8au,gov,nla)/19970413005246http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA6031997-04-13 00:52:461997
9au,gov,nla)/19970418074154http://www.nla.gov.au:80/text/html200NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB11231997-04-18 07:41:541997
\n", "
" ], "text/plain": [ " urlkey timestamp original mimetype \\\n", "0 au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html \n", "1 au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html \n", "2 au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html \n", "3 au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html \n", "4 au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html \n", "5 au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html \n", "6 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html \n", "7 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html \n", "8 au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html \n", "9 au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html \n", "\n", " statuscode digest length date \\\n", "0 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135 1996-10-19 06:42:23 \n", "1 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138 1996-12-21 10:27:55 \n", "2 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 1996-12-21 13:23:58 \n", "3 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457 1996-12-23 03:18:39 \n", "4 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141 1997-02-12 05:34:05 \n", "5 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 1997-02-15 22:25:54 \n", "6 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126 1997-03-15 23:06:40 \n", "7 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140 1997-03-15 23:06:40 \n", "8 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 1997-04-13 00:52:46 \n", "9 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123 1997-04-18 07:41:54 \n", "\n", " year \n", "0 1996 \n", "1 1996 \n", "2 1996 \n", "3 1996 \n", "4 1997 \n", "5 1997 \n", "6 1997 \n", "7 1997 \n", "8 1997 \n", "9 1997 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rows 1, 4, and 7 all have the same `digest`, but the `length` value is different. How can the files be longer, but the same? We'll look at `length` next, but answer is that the `length` includes the response headers sent by the web server *as well as* the actual content of the file. The length of the headers might change depending on the context in which the file was requested, even though the file itself remains the same.\n", "\n", "Using the `digest` field we can find out how many of the captures in the `nla.gov.au` result set are unique." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.97% unique\n" ] } ], "source": [ "print(f'{len(df[\"digest\"].unique()) / df.shape[0]:.2%} unique')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In theory, we should be able to use the `digest` value to check that the file that was originally captured is the same as the file we can access now through the Wayback Machine. Let's give it a try!" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Digests match? True\n", "Digests match? True\n", "Digests match? True\n", "Digests match? True\n", "Digests match? True\n", "Digests match? True\n", "Digests match? True\n", "Digests match? False\n", "Digests match? True\n", "Digests match? True\n" ] } ], "source": [ "for row in results[1:11]:\n", " snapshot_url = f\"https://web.archive.org/web/{row[1]}id_/http://www.nla.gov.au/\"\n", " response = requests.get(snapshot_url)\n", " checksum = b32encode(sha1(response.content).digest())\n", " print(f\"Digests match? {checksum.decode() == row[5]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, so it seems we can't assume that pages preserved in the web archive will remain unchanged from the moment of capture, but the `digest` does at least give us a way of checking." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Length\n", "\n", "You'd think `length` would be pretty straightforward, but as noted above it includes the headers as well as the file. Also, it's the size of the file and headers stored in compressed form on disk. As a result the length might vary according to the technology used to store the capture. So `length` gives us an indication of the original file size, but not an exact measure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use the `length` field in calculations using Pandas, you'll need to make sure it's being stored as an integer." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "df[\"length\"] = df[\"length\"].astype(\"int\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlengthdateyear
0au,gov,nla)/19961019064223http://www.nla.gov.au:80/text/html200M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI11351996-10-19 06:42:231996
1au,gov,nla)/19961221102755http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE11381996-12-21 10:27:551996
2au,gov,nla)/19961221132358http://nla.gov.au:80/text/html20065SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA6031996-12-21 13:23:581996
3au,gov,nla)/19961223031839http://www2.nla.gov.au:80/text/html2006XHDP66AXEPMVKVROHHDN6CPZYHZICEX4571996-12-23 03:18:391996
4au,gov,nla)/19970212053405http://www.nla.gov.au:80/text/html200TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE11411997-02-12 05:34:051997
..............................
29686au,gov,nla)/20230501080449http://nla.gov.au/text/html301HLNR6AWVWYCU3YAENY3HYHLIPNWN66X73252023-05-01 08:04:492023
29687au,gov,nla)/20230501080546http://nla.gov.au/text/html301HLNR6AWVWYCU3YAENY3HYHLIPNWN66X73272023-05-01 08:05:462023
29688au,gov,nla)/20230501084252http://nla.gov.au/text/html301HLNR6AWVWYCU3YAENY3HYHLIPNWN66X73252023-05-01 08:42:522023
29689au,gov,nla)/20230501084448http://nla.gov.au/text/html301HLNR6AWVWYCU3YAENY3HYHLIPNWN66X73242023-05-01 08:44:482023
29690au,gov,nla)/20230501085234http://nla.gov.au/text/html301HLNR6AWVWYCU3YAENY3HYHLIPNWN66X73252023-05-01 08:52:342023
\n", "

29691 rows × 9 columns

\n", "
" ], "text/plain": [ " urlkey timestamp original mimetype \\\n", "0 au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html \n", "1 au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html \n", "2 au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html \n", "3 au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html \n", "4 au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html \n", "... ... ... ... ... \n", "29686 au,gov,nla)/ 20230501080449 http://nla.gov.au/ text/html \n", "29687 au,gov,nla)/ 20230501080546 http://nla.gov.au/ text/html \n", "29688 au,gov,nla)/ 20230501084252 http://nla.gov.au/ text/html \n", "29689 au,gov,nla)/ 20230501084448 http://nla.gov.au/ text/html \n", "29690 au,gov,nla)/ 20230501085234 http://nla.gov.au/ text/html \n", "\n", " statuscode digest length \\\n", "0 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135 \n", "1 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138 \n", "2 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603 \n", "3 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457 \n", "4 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141 \n", "... ... ... ... \n", "29686 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 325 \n", "29687 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 327 \n", "29688 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 325 \n", "29689 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 324 \n", "29690 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 325 \n", "\n", " date year \n", "0 1996-10-19 06:42:23 1996 \n", "1 1996-12-21 10:27:55 1996 \n", "2 1996-12-21 13:23:58 1996 \n", "3 1996-12-23 03:18:39 1996 \n", "4 1997-02-12 05:34:05 1997 \n", "... ... ... \n", "29686 2023-05-01 08:04:49 2023 \n", "29687 2023-05-01 08:05:46 2023 \n", "29688 2023-05-01 08:42:52 2023 \n", "29689 2023-05-01 08:44:48 2023 \n", "29690 2023-05-01 08:52:34 2023 \n", "\n", "[29691 rows x 9 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Putting it all together\n", "\n", "Let's use the `timestamp`, `length`, and `statuscode` fields to look at all the captures of `http://nla.gov.au`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df).mark_point().encode(\n", " x=\"date:T\",\n", " y=\"length:Q\",\n", " color=\"statuscode\",\n", ").properties(width=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a few interesting things to note. The first is how `statuscode` correlates with `length`. As we noted above, only the header data is captured from redirects, so we'd expect them to be small and fairly consistent in size. But why are there both `200` and `302` responses for the same page? And why the sudden increase in `301` responses in mid-2018? It's also interesting to see how the size of the page has increased over time. But why are there sudden jumps in the `length`?\n", "\n", "To explore all these questions and more, head to the [change in a page over time](change_in_a_page_over_time.ipynb) notebook!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }