{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring subdomains in the whole of gov.au\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "Most of the notebooks in this repository work with small slices of web archive data. In this notebook we'll scale things up a bit to try and find all of the subdomains that have existed in the `gov.au` domain. As in other notebooks, we'll obtain the data by querying the Internet Archive's CDX API. The only real difference is that it will take some hours to harvest all the data.\n", "\n", "All we're interested in this time are unique domain names, so to minimise the amount of data we'll be harvesting we can make use of the CDX API's `collapse` parameter. By setting `collapse=urlkey` we can tell the CDX API to drop records with duplicate `urlkey` values – this should mean we only get one capture per page. However, this only works if the capture records are in adjacent rows, so there probably will still be some duplicates. We'll also use the `fl` to limit the fields returned, and the `filter` parameter to limit results by `statuscode` and `mimetype`. So the parameters we'll use are:\n", "\n", "* `url=*.gov.au` – all of the pages in all of the subdomains under `gov.au`\n", "* `collapse=urlkey` – as few captures per page as possible\n", "* `filter=statuscode:200,mimetype:text/html` – only successful captures of HTML pages\n", "* `fl=urlkey,timestamp,original` – only these fields\n", "\n", "Even with these limits, the query will retrieve a LOT of data. To make the harvesting process easier to manage and more robust, I'm going to make use of the `requests-cache` module. This will capture the results of all requests, so that if things get interrupted and we have to restart, we can retrieve already harvested requests from the cache without downloading them again. We'll also write the harvested results directly to disk rather than consuming all our computer's memory. The file format will be the NDJSON (Newline Delineated JSON) format – because each line is a separate JSON object we can just write it a line at a time as the data is received.\n", "\n", "For a general approach to harvesting domain-level information from the IA CDX API see [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)\n" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "import pandas as pd\n", "import time\n", "from requests_cache import CachedSession\n", "import ndjson\n", "from pathlib import Path\n", "from slugify import slugify\n", "import arrow\n", "import json\n", "import re\n", "from newick import Node\n", "import newick\n", "from ete3 import Tree, TreeStyle\n", "import ipywidgets as widgets\n", "from IPython.display import display, HTML, FileLink\n", "\n", "s = CachedSession()\n", "retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('https://', HTTPAdapter(max_retries=retries))\n", "s.mount('http://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "domain = 'gov.au'" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def get_total_pages(params):\n", " '''\n", " Gets the total number of pages in a set of results.\n", " '''\n", " these_params = params.copy()\n", " these_params['showNumPages'] = 'true'\n", " response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})\n", " return int(response.text)\n", "\n", "def prepare_params(url, **kwargs):\n", " '''\n", " Prepare the parameters for a CDX API requests.\n", " Adds all supplied keyword arguments as parameters (changing from_ to from).\n", " Adds in a few necessary parameters.\n", " '''\n", " params = kwargs\n", " params['url'] = url\n", " params['output'] = 'json'\n", " # CDX accepts a 'from' parameter, but this is a reserved word in Python\n", " # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.\n", " if 'from_' in params:\n", " params['from'] = params['from_']\n", " del(params['from_'])\n", " return params\n", "\n", "def get_cdx_data(params):\n", " '''\n", " Make a request to the CDX API using the supplied parameters.\n", " Check the results for a resumption key, and return the key (if any) and the results.\n", " '''\n", " response = s.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})\n", " response.raise_for_status()\n", " results = response.json()\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " return results\n", "\n", "def convert_lists_to_dicts(results):\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " return results_as_dicts\n", "\n", "def get_cdx_data_by_page(url, **kwargs):\n", " page = 0\n", " params = prepare_params(url, **kwargs)\n", " total_pages = get_total_pages(params)\n", " # We'll use a timestamp to distinguish between versions\n", " timestamp = arrow.now().format('YYYYMMDDHHmmss')\n", " file_path = Path(f'{slugify(domain)}-cdx-data-{timestamp}.ndjson')\n", " # Remove any old versions of the data file\n", " try:\n", " file_path.unlink()\n", " except FileNotFoundError:\n", " pass\n", " with tqdm(total=total_pages-page) as pbar1:\n", " with tqdm() as pbar2:\n", " while page < total_pages:\n", " params['page'] = page\n", " results = get_cdx_data(params)\n", " with file_path.open('a') as f:\n", " writer = ndjson.writer(f, ensure_ascii=False)\n", " for result in convert_lists_to_dicts(results):\n", " writer.writerow(result)\n", " page += 1\n", " pbar1.update(1)\n", " pbar2.update(len(results) - 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note than harvesting a domain has the same number of pages (ie requests) no matter what filters are applied -- it's just that some pages will be empty.\n", "# So repeating a domain harvest with different filters will mean less data, but the same number of requests.\n", "# What's most efficient? I dunno.\n", "get_cdx_data_by_page(f'*.{domain}', filter=['statuscode:200', 'mimetype:text/html'], collapse='urlkey', fl='urlkey,timestamp,original', pageSize=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process the harvested data\n", "\n", "After many hours, and many interruptions, the harvesting process finally finished. I ended up with a 65gb ndjson file. How many captures does it include?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "189,639,944\n", "CPU times: user 1min 4s, sys: 27 s, total: 1min 31s\n", "Wall time: 2min 26s\n" ] } ], "source": [ "%%time\n", "count = 0\n", "with open('gov-au-cdx-data.ndjson') as f:\n", " for line in f:\n", " count += 1\n", "print(f'{count:,}') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find unique domains" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get extract a list of unique domains from all of those page captures. In the code below we extract domains from the `urlkey` and add them to a list. After every 100,000 lines, we use `set` to remove duplicates from the list. This is an attempt to find a reasonable balance between speed and memory consumption." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c6f2cf0ce77d4e7db04744a152cfdf2a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "CPU times: user 16min 7s, sys: 30.8 s, total: 16min 38s\n", "Wall time: 16min 40s\n" ] } ], "source": [ "%%time\n", "# This is slow, but will avoid eating up memory\n", "domains = []\n", "with open('gov-au-cdx-data.ndjson') as f:\n", " count = 0\n", " with tqdm() as pbar:\n", " for line in f:\n", " capture = json.loads(line)\n", " # Split the urlkey on ) to separate domain from path\n", " domain = capture['urlkey'].split(')')[0]\n", " # Remove port numbers\n", " domain = re.sub(r'\\:\\d+', '', domain)\n", " domains.append(domain)\n", " count += 1\n", " # Remove duplicates after every 100,000 lines to conserve memory\n", " if count > 100000:\n", " domains = list(set(domains))\n", " pbar.update(count)\n", " count = 0\n", "domains = list(set(domains))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many unique domains are there?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "26233" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(domains)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(domains, columns=['urlkey'])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the list of domains to a CSV file to save us having to extract them again." ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/html": [ "domains/gov-au-unique-domains.csv\n", " | urlkey | \n", "number_of_pages | \n", "
---|---|---|
0 | \n", "au,gov,qld,justice,mogservices | \n", "6 | \n", "
1 | \n", "au,gov,health,business | \n", "9 | \n", "
2 | \n", "au,gov,qld,sasvrc | \n", "173 | \n", "
3 | \n", "au,gov,qld,qfes,dmlms | \n", "4 | \n", "
4 | \n", "au,gov,wa,kwinana,maps | \n", "4 | \n", "
\n", " | urlkey | \n", "number_of_pages | \n", "0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "domain | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "au,gov,qld,justice,mogservices | \n", "6 | \n", "au | \n", "gov | \n", "qld | \n", "justice | \n", "mogservices | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "mogservices.justice.qld.gov.au | \n", "
1 | \n", "au,gov,health,business | \n", "9 | \n", "au | \n", "gov | \n", "health | \n", "business | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "business.health.gov.au | \n", "
2 | \n", "au,gov,qld,sasvrc | \n", "173 | \n", "au | \n", "gov | \n", "qld | \n", "sasvrc | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "sasvrc.qld.gov.au | \n", "
3 | \n", "au,gov,qld,qfes,dmlms | \n", "4 | \n", "au | \n", "gov | \n", "qld | \n", "qfes | \n", "dmlms | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "dmlms.qfes.qld.gov.au | \n", "
4 | \n", "au,gov,wa,kwinana,maps | \n", "4 | \n", "au | \n", "gov | \n", "wa | \n", "kwinana | \n", "maps | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "maps.kwinana.wa.gov.au | \n", "
domain | number_of_pages | |
---|---|---|
24311 | \n", "trove.nla.gov.au | \n", "9,285,603 | \n", "
8551 | \n", "nla.gov.au | \n", "2,592,182 | \n", "
17232 | \n", "collectionsearch.nma.gov.au | \n", "2,422,514 | \n", "
4551 | \n", "passwordreset.parliament.qld.gov.au | \n", "2,089,256 | \n", "
18817 | \n", "parlinfo.aph.gov.au | \n", "1,882,646 | \n", "
2050 | \n", "aph.gov.au | \n", "1,731,559 | \n", "
11539 | \n", "bmcc.nsw.gov.au | \n", "1,414,711 | \n", "
18038 | \n", "jobsearch.gov.au | \n", "1,293,760 | \n", "
4556 | \n", "arpansa.gov.au | \n", "1,278,603 | \n", "
22182 | \n", "abs.gov.au | \n", "961,526 | \n", "
1844 | \n", "libero.gtcc.nsw.gov.au | \n", "959,490 | \n", "
24888 | \n", "canterbury.nsw.gov.au | \n", "956,500 | \n", "
20982 | \n", "library.campbelltown.nsw.gov.au | \n", "932,933 | \n", "
9451 | \n", "defencejobs.gov.au | \n", "894,770 | \n", "
18377 | \n", "webopac.gosford.nsw.gov.au | \n", "854,395 | \n", "
3162 | \n", "library.lachlan.nsw.gov.au | \n", "838,972 | \n", "
6141 | \n", "library.shoalhaven.nsw.gov.au | \n", "800,541 | \n", "
16750 | \n", "catalogue.nla.gov.au | \n", "787,616 | \n", "
25461 | \n", "library.bankstown.nsw.gov.au | \n", "767,550 | \n", "
14964 | \n", "myagedcare.gov.au | \n", "759,384 | \n", "
{label.upper()}
NSW
VIC
QLD
SA
WA
TAS
NT
ACT