{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring the Internet Archive's CDX API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CDX APIs provide access to an index of resources captured by web archives. The results can be filtered in a number of ways, making them a convenient way of harvesting and exploring the holdings of web archives. This notebook focuses on the data you can obtain from the Internet Archives' CDX API. For more information on the differences between this and other CDX APIs see [Comparing CDX APIs](comparing_cdx_apis.ipynb). To examine differences between CDX data and Timemaps see [Timemaps vs CDX APIs](getting_all_snapshots_timemap_vs_cdx.ipynb).\n", "\n", "Notebooks demonstrating ways of getting and using CDX data include:\n", "\n", "* [Find all the archived versions of a web page](find_all_captures.ipynb)\n", "* [Harvesting collections of text from archived web pages](getting_text_from_web_pages.ipynb)\n", "* [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)\n", "* [Find and explore Powerpoint presentations from a specific domain](explore_presentations.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "vegafusion.enable(mimetype='html', row_limit=30000, embed_options=None)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "from base64 import b32encode\n", "from hashlib import sha1\n", "\n", "import altair as alt\n", "import arrow\n", "import pandas as pd\n", "import requests\n", "import vegafusion as vf\n", "from tqdm.auto import tqdm\n", "\n", "vf.enable(row_limit=30000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Useful resources\n", "\n", "* [Wayback Machine APIs](https://archive.org/help/wayback_api.php)\n", "* [Wayback CDX API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server)\n", "* [Archive-it's CDX/C API](https://support.archive-it.org/hc/en-us/articles/115001790023-Access-Archive-It-s-Wayback-index-with-the-CDX-C-API) – includes useful general documentation of CDX format\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Your first CDX request\n", "\n", "Let's have a look at the sort of data the CDX server gives us. At the very least, we have to provide a `url` parameter to point to a particular page (or domain as we'll see below). To avoid flinging too much data about, we'll also add a `limit` parameter that tells the CDX server how many rows of data to give us." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135\n", "au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138\n", "au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603\n", "au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457\n", "au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141\n", "au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603\n", "au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126\n", "au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140\n", "au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603\n", "au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123\n", "\n" ] } ], "source": [ "# 8 April 2020 -- without the 'User-Agent' header parameter I get a 445 error\n", "# 27 April 2020 - now seems ok without changing User-Agent\n", "\n", "# Feel free to change these values\n", "params1 = {\"url\": \"http://nla.gov.au\", \"limit\": 10}\n", "\n", "# Get the data and print the results\n", "response = requests.get(\"https://web.archive.org/cdx/search/cdx\", params=params1)\n", "print(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the results are returned in a simple text format – fields are separated by spaces, and each result is on a separate line. It's a bit hard to read in this format, so let's add the `output` parameter to get the results in JSON format. We'll then use Pandas to display the results in a table." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "au,gov,nla)/ | \n", "19961019064223 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI | \n", "1135 | \n", "
1 | \n", "au,gov,nla)/ | \n", "19961221102755 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1138 | \n", "
2 | \n", "au,gov,nla)/ | \n", "19961221132358 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "
3 | \n", "au,gov,nla)/ | \n", "19961223031839 | \n", "http://www2.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "6XHDP66AXEPMVKVROHHDN6CPZYHZICEX | \n", "457 | \n", "
4 | \n", "au,gov,nla)/ | \n", "19970212053405 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1141 | \n", "
5 | \n", "au,gov,nla)/ | \n", "19970215222554 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "
6 | \n", "au,gov,nla)/ | \n", "19970315230640 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB | \n", "1126 | \n", "
7 | \n", "au,gov,nla)/ | \n", "19970315230640 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1140 | \n", "
8 | \n", "au,gov,nla)/ | \n", "19970413005246 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "
9 | \n", "au,gov,nla)/ | \n", "19970418074154 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB | \n", "1123 | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "au,gov,naa)/%20recordkeeping/dirks/dirksman/di... | \n", "20230316090540 | \n", "https://www.naa.gov.au/%20recordkeeping/dirks/... | \n", "text/html | \n", "404 | \n", "QEBKDGBNMPKFQIDHH36SSN2ESXBDRCOB | \n", "919 | \n", "
1 | \n", "au,gov,naa)/.../an-approach-green-paper_tcm16-... | \n", "20140924080305 | \n", "http://www.naa.gov.au/.../An-approach-Green-Pa... | \n", "text/html | \n", "302 | \n", "RBAUTMMEDESHYHSQ5PCUWUILGZSLFOIR | \n", "902 | \n", "
2 | \n", "au,gov,naa)/.../digital-preservation-software-... | \n", "20141011134912 | \n", "http://www.naa.gov.au/.../Digital-Preservation... | \n", "text/html | \n", "302 | \n", "2EKIQ2YLXTDK5CS4VPFJMEZGNMATFEDG | \n", "913 | \n", "
3 | \n", "au,gov,naa)/.../holt.pdf | \n", "20141010120041 | \n", "http://www.naa.gov.au/.../holt.pdf | \n", "text/html | \n", "302 | \n", "GDGHFNKCSTNJENDMGLHQWNVXUIAIUZYN | \n", "880 | \n", "
4 | \n", "au,gov,naa)/.../horrie_tcm16-36799.pdf | \n", "20141011001020 | \n", "http://www.naa.gov.au/.../horrie_tcm16-36799.pdf | \n", "text/html | \n", "302 | \n", "IMGTBT5B33WIEHQ7E6MDMIHBW3UMI2MQ | \n", "889 | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "au,gov,nla)/ | \n", "19961019064223 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI | \n", "1135 | \n", "1996-10-19 06:42:23 | \n", "1996 | \n", "
1 | \n", "au,gov,nla)/ | \n", "19961221102755 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1138 | \n", "1996-12-21 10:27:55 | \n", "1996 | \n", "
2 | \n", "au,gov,nla)/ | \n", "19961221132358 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "1996-12-21 13:23:58 | \n", "1996 | \n", "
3 | \n", "au,gov,nla)/ | \n", "19961223031839 | \n", "http://www2.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "6XHDP66AXEPMVKVROHHDN6CPZYHZICEX | \n", "457 | \n", "1996-12-23 03:18:39 | \n", "1996 | \n", "
4 | \n", "au,gov,nla)/ | \n", "19970212053405 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1141 | \n", "1997-02-12 05:34:05 | \n", "1997 | \n", "
5 | \n", "au,gov,nla)/ | \n", "19970215222554 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "1997-02-15 22:25:54 | \n", "1997 | \n", "
6 | \n", "au,gov,nla)/ | \n", "19970315230640 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB | \n", "1126 | \n", "1997-03-15 23:06:40 | \n", "1997 | \n", "
7 | \n", "au,gov,nla)/ | \n", "19970315230640 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1140 | \n", "1997-03-15 23:06:40 | \n", "1997 | \n", "
8 | \n", "au,gov,nla)/ | \n", "19970413005246 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "1997-04-13 00:52:46 | \n", "1997 | \n", "
9 | \n", "au,gov,nla)/ | \n", "19970418074154 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB | \n", "1123 | \n", "1997-04-18 07:41:54 | \n", "1997 | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "au,gov,nla)/ | \n", "19961019064223 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI | \n", "1135 | \n", "1996-10-19 06:42:23 | \n", "1996 | \n", "
1 | \n", "au,gov,nla)/ | \n", "19961221102755 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1138 | \n", "1996-12-21 10:27:55 | \n", "1996 | \n", "
2 | \n", "au,gov,nla)/ | \n", "19961221132358 | \n", "http://nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA | \n", "603 | \n", "1996-12-21 13:23:58 | \n", "1996 | \n", "
3 | \n", "au,gov,nla)/ | \n", "19961223031839 | \n", "http://www2.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "6XHDP66AXEPMVKVROHHDN6CPZYHZICEX | \n", "457 | \n", "1996-12-23 03:18:39 | \n", "1996 | \n", "
4 | \n", "au,gov,nla)/ | \n", "19970212053405 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE | \n", "1141 | \n", "1997-02-12 05:34:05 | \n", "1997 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
29686 | \n", "au,gov,nla)/ | \n", "20230501080449 | \n", "http://nla.gov.au/ | \n", "text/html | \n", "301 | \n", "HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 | \n", "325 | \n", "2023-05-01 08:04:49 | \n", "2023 | \n", "
29687 | \n", "au,gov,nla)/ | \n", "20230501080546 | \n", "http://nla.gov.au/ | \n", "text/html | \n", "301 | \n", "HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 | \n", "327 | \n", "2023-05-01 08:05:46 | \n", "2023 | \n", "
29688 | \n", "au,gov,nla)/ | \n", "20230501084252 | \n", "http://nla.gov.au/ | \n", "text/html | \n", "301 | \n", "HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 | \n", "325 | \n", "2023-05-01 08:42:52 | \n", "2023 | \n", "
29689 | \n", "au,gov,nla)/ | \n", "20230501084448 | \n", "http://nla.gov.au/ | \n", "text/html | \n", "301 | \n", "HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 | \n", "324 | \n", "2023-05-01 08:44:48 | \n", "2023 | \n", "
29690 | \n", "au,gov,nla)/ | \n", "20230501085234 | \n", "http://nla.gov.au/ | \n", "text/html | \n", "301 | \n", "HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 | \n", "325 | \n", "2023-05-01 08:52:34 | \n", "2023 | \n", "
29691 rows × 9 columns
\n", "