{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring subdomains in the whole of gov.au\n",
    "\n",
    "<p class=\"alert alert-info\">New to Jupyter notebooks? Try <a href=\"getting-started/Using_Jupyter_notebooks.ipynb\"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>\n",
    "\n",
    "Most of the notebooks in this repository work with small slices of web archive data. In this notebook we'll scale things up a bit to try and find all of the subdomains that have existed in the `gov.au` domain. As in other notebooks, we'll obtain the data by querying the Internet Archive's CDX API. The only real difference is that it will take some hours to harvest all the data.\n",
    "\n",
    "All we're interested in this time are unique domain names, so to minimise the amount of data we'll be harvesting we can make use of the CDX API's `collapse` parameter. By setting `collapse=urlkey` we can tell the CDX API to drop records with duplicate `urlkey` values – this should mean we only get one capture per page. However, this only works if the capture records are in adjacent rows, so there probably will still be some duplicates. We'll also use the `fl` to limit the fields returned, and the `filter` parameter to limit results by `statuscode` and `mimetype`. So the parameters we'll use are:\n",
    "\n",
    "* `url=*.gov.au` – all of the pages in all of the subdomains under `gov.au`\n",
    "* `collapse=urlkey` – as few captures per page as possible\n",
    "* `filter=statuscode:200,mimetype:text/html` – only successful captures of HTML pages\n",
    "* `fl=urlkey,timestamp,original` – only these fields\n",
    "\n",
    "Even with these limits, the query will retrieve a LOT of data. To make the harvesting process easier to manage and more robust, I'm going to make use of the `requests-cache` module. This will capture the results of all requests, so that if things get interrupted and we have to restart, we can retrieve already harvested requests from the cache without downloading them again. We'll also write the harvested results directly to disk rather than consuming all our computer's memory. The file format will be the NDJSON (Newline Delineated JSON) format – because each line is a separate JSON object we can just write it a line at a time as the data is received.\n",
    "\n",
    "For a general approach to harvesting domain-level information from the IA CDX API see [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)\n",
    "\n",
    "If you'd like to access pre-harvested datasets, you can download the following files from Cloudstor:\n",
    "\n",
    "* [gov-au-cdx-data-20220406105227.ndjson](https://cloudstor.aarnet.edu.au/plus/s/F3BjoCaS5U3BHCh/download?path=gov-au-cdx-data-20220406105227.ndjson) (75.7gb) – this is the raw data saved as newline delimited JSON\n",
    "* [gov-au-domains-split-20220406155220.csv](https://cloudstor.aarnet.edu.au/plus/s/F3BjoCaS5U3BHCh/download?path=gov-au-domains-split-20220406155220.csv) (2.4mb) – this is a CSV file containing unique domains, split into subdomains\n",
    "* [gov-au-unique-domains-20220406131052.csv](https://cloudstor.aarnet.edu.au/plus/s/F3BjoCaS5U3BHCh/download?path=gov-au-unique-domains-20220406131052.csv) – this is a CSV file containing unique domains in SURT format\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import re\n",
    "import time\n",
    "from pathlib import Path\n",
    "\n",
    "import arrow\n",
    "import ndjson\n",
    "import newick\n",
    "import pandas as pd\n",
    "import requests\n",
    "from ete3 import Tree, TreeStyle\n",
    "from IPython.display import HTML, FileLink, display\n",
    "from newick import Node\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from requests_cache import CachedSession\n",
    "from slugify import slugify\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "os.environ[\"QT_QPA_PLATFORM\"] = \"offscreen\"\n",
    "\n",
    "s = CachedSession()\n",
    "retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "domain = \"gov.au\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_total_pages(params):\n",
    "    \"\"\"\n",
    "    Gets the total number of pages in a set of results.\n",
    "    \"\"\"\n",
    "    these_params = params.copy()\n",
    "    these_params[\"showNumPages\"] = \"true\"\n",
    "    response = s.get(\n",
    "        \"http://web.archive.org/cdx/search/cdx\",\n",
    "        params=these_params,\n",
    "        headers={\"User-Agent\": \"\"},\n",
    "    )\n",
    "    return int(response.text)\n",
    "\n",
    "\n",
    "def prepare_params(url, **kwargs):\n",
    "    \"\"\"\n",
    "    Prepare the parameters for a CDX API requests.\n",
    "    Adds all supplied keyword arguments as parameters (changing from_ to from).\n",
    "    Adds in a few necessary parameters.\n",
    "    \"\"\"\n",
    "    params = kwargs\n",
    "    params[\"url\"] = url\n",
    "    params[\"output\"] = \"json\"\n",
    "    # CDX accepts a 'from' parameter, but this is a reserved word in Python\n",
    "    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.\n",
    "    if \"from_\" in params:\n",
    "        params[\"from\"] = params[\"from_\"]\n",
    "        del params[\"from_\"]\n",
    "    return params\n",
    "\n",
    "\n",
    "def get_cdx_data(params):\n",
    "    \"\"\"\n",
    "    Make a request to the CDX API using the supplied parameters.\n",
    "    Check the results for a resumption key, and return the key (if any) and the results.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        response = s.get(\n",
    "            \"http://web.archive.org/cdx/search/cdx\", params=params, timeout=120\n",
    "        )\n",
    "    # Some pages generate errors -- seems to be a problem at server end, so we'll ignore.\n",
    "    # This could mean some data is lost?\n",
    "    except requests.exceptions.ChunkedEncodingError:\n",
    "        print(f'Error page {params[\"page\"]}')\n",
    "        return None\n",
    "    else:\n",
    "        response.raise_for_status()\n",
    "        results = response.json()\n",
    "        if not response.from_cache:\n",
    "            time.sleep(0.2)\n",
    "    return results\n",
    "\n",
    "\n",
    "def convert_lists_to_dicts(results):\n",
    "    if results:\n",
    "        keys = results[0]\n",
    "        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n",
    "    else:\n",
    "        results_as_dicts = results\n",
    "    return results_as_dicts\n",
    "\n",
    "\n",
    "def get_cdx_data_by_page(url, **kwargs):\n",
    "    page = 0\n",
    "    params = prepare_params(url, **kwargs)\n",
    "    total_pages = get_total_pages(params)\n",
    "    # We'll use a timestamp to distinguish between versions\n",
    "    timestamp = arrow.now().format(\"YYYYMMDDHHmmss\")\n",
    "    file_path = Path(f\"{slugify(domain)}-cdx-data-{timestamp}.ndjson\")\n",
    "    # Remove any old versions of the data file\n",
    "    try:\n",
    "        file_path.unlink()\n",
    "    except FileNotFoundError:\n",
    "        pass\n",
    "    with tqdm(total=total_pages - page) as pbar1:\n",
    "        with tqdm() as pbar2:\n",
    "            while page < total_pages:\n",
    "                params[\"page\"] = page\n",
    "                results = get_cdx_data(params)\n",
    "                if results:\n",
    "                    with file_path.open(\"a\") as f:\n",
    "                        writer = ndjson.writer(f, ensure_ascii=False)\n",
    "                        for result in convert_lists_to_dicts(results):\n",
    "                            writer.writerow(result)\n",
    "                    pbar2.update(len(results) - 1)\n",
    "                page += 1\n",
    "                pbar1.update(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Note than harvesting a domain has the same number of pages (ie requests) no matter what filters are applied -- it's just that some pages will be empty.\n",
    "# So repeating a domain harvest with different filters will mean less data, but the same number of requests.\n",
    "# What's most efficient? I dunno.\n",
    "get_cdx_data_by_page(\n",
    "    f\"*.{domain}\",\n",
    "    filter=[\"statuscode:200\", \"mimetype:text/html\"],\n",
    "    collapse=\"urlkey\",\n",
    "    fl=\"urlkey,timestamp,original\",\n",
    "    pageSize=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Process the harvested data\n",
    "\n",
    "After many hours, and many interruptions, the harvesting process finally finished. I ended up with a 65gb ndjson file. How many captures does it include?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "213,107,491\n",
      "CPU times: user 38.5 s, sys: 29.3 s, total: 1min 7s\n",
      "Wall time: 1min 24s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "latest_data = sorted(list(Path(\".\").glob(\"gov-au-cdx-data-*\")), reverse=True)[0]\n",
    "\n",
    "count = 0\n",
    "with latest_data.open() as f:\n",
    "    for line in f:\n",
    "        count += 1\n",
    "print(f\"{count:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find unique domains"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's get extract a list of unique domains from all of those page captures. In the code below we extract domains from the `urlkey` and add them to a list. After every 100,000 lines, we use `set` to remove duplicates from the list. This is an attempt to find a reasonable balance between speed and memory consumption."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "da953625af9e49a6a04672c2dac6d62e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 10min 47s, sys: 21.6 s, total: 11min 9s\n",
      "Wall time: 11min 8s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# This is slow, but will avoid eating up memory\n",
    "domains = []\n",
    "with latest_data.open() as f:\n",
    "    count = 0\n",
    "    with tqdm() as pbar:\n",
    "        for line in f:\n",
    "            capture = json.loads(line)\n",
    "            # Split the urlkey on ) to separate domain from path\n",
    "            domain = capture[\"urlkey\"].split(\")\")[0]\n",
    "            # Remove port numbers\n",
    "            domain = re.sub(r\"\\:\\d+\", \"\", domain)\n",
    "            domains.append(domain)\n",
    "            count += 1\n",
    "            # Remove duplicates after every 100,000 lines to conserve memory\n",
    "            if count > 100000:\n",
    "                domains = list(set(domains))\n",
    "                pbar.update(count)\n",
    "                count = 0\n",
    "domains = list(set(domains))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many unique domains are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "28461"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(domains)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>urlkey</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>au,gov,qld,rockhampton</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>au,gov,ag,sat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>au,gov,vic,ffm,confluence</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>au,gov,nsw,dumaresq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>au,gov,wa,dpc,scienceandinnovation</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               urlkey\n",
       "0              au,gov,qld,rockhampton\n",
       "1                       au,gov,ag,sat\n",
       "2           au,gov,vic,ffm,confluence\n",
       "3                 au,gov,nsw,dumaresq\n",
       "4  au,gov,wa,dpc,scienceandinnovation"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(domains, columns=[\"urlkey\"])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the list of domains to a CSV file to save us having to extract them again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='domains/gov-au-unique-domains-20220406131052.csv' target='_blank'>domains/gov-au-unique-domains-20220406131052.csv</a><br>"
      ],
      "text/plain": [
       "/home/tim/Workspace/mycode/glam-workbench/webarchives/notebooks/domains/gov-au-unique-domains-20220406131052.csv"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "unique_filename = (\n",
    "    f'domains/gov-au-unique-domains-{arrow.now().format(\"YYYYMMDDHHmmss\")}.csv'\n",
    ")\n",
    "df.to_csv(unique_filename, index=False)\n",
    "display(FileLink(unique_filename))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reload the list of domains from the CSV if necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "latest_domains = sorted(\n",
    "    list(Path(\"domains\").glob(\"gov-au-unique-domains-*\")), reverse=True\n",
    ")[0]\n",
    "domains = pd.read_csv(latest_domains)[\"urlkey\"].to_list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number of unique urls per subdomain\n",
    "\n",
    "Now that we have a list of unique domains we can use this to generate a count of unique urls per subdomain. This won't be exact. As noted previously, even with `collapse` set to `urlkey` there are likely to be duplicate urls. Getting rid of all the duplicates in such a large file would require a fair bit of processing, and I'm not sure it's worth it at this point. We really just want a sense of how subdomains are actually used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Create a dictionary with the domains as keys and the values set to zero\n",
    "domain_counts = dict(zip(domains, [0] * len(domains)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5bf731c56cb6483b8223ff3a3ab9bf3c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 10min 56s, sys: 18.7 s, total: 11min 15s\n",
      "Wall time: 11min 14s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# FIND NUMBER OF URLS PER DOMAIN\n",
    "# As above we'll go though the file line by line\n",
    "# but this time we'll extract the domain and increment the corresponding value in the dict.\n",
    "with latest_data.open() as f:\n",
    "    count = 0\n",
    "    with tqdm() as pbar:\n",
    "        for line in f:\n",
    "            capture = json.loads(line)\n",
    "            # Split the urlkey on ) to separate domain from path\n",
    "            domain = capture[\"urlkey\"].split(\")\")[0]\n",
    "            domain = re.sub(r\"\\:\\d+\", \"\", domain)\n",
    "            # Increment domain count\n",
    "            domain_counts[domain] += 1\n",
    "            count += 1\n",
    "            # This is just to update the progress bar\n",
    "            if count > 100000:\n",
    "                pbar.update(count)\n",
    "                count = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert to a dataframe\n",
    "\n",
    "We'll now convert the data to a dataframe and do a bit more processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>urlkey</th>\n",
       "      <th>number_of_pages</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>au,gov,qld,rockhampton</td>\n",
       "      <td>10069</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>au,gov,ag,sat</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>au,gov,vic,ffm,confluence</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>au,gov,nsw,dumaresq</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>au,gov,wa,dpc,scienceandinnovation</td>\n",
       "      <td>438</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               urlkey  number_of_pages\n",
       "0              au,gov,qld,rockhampton            10069\n",
       "1                       au,gov,ag,sat               14\n",
       "2           au,gov,vic,ffm,confluence                5\n",
       "3                 au,gov,nsw,dumaresq               33\n",
       "4  au,gov,wa,dpc,scienceandinnovation              438"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reshape dict as a list of dicts\n",
    "domain_counts_as_list = [\n",
    "    {\"urlkey\": k, \"number_of_pages\": v} for k, v in domain_counts.items()\n",
    "]\n",
    "\n",
    "# Convert to dataframe\n",
    "df_counts = pd.DataFrame(domain_counts_as_list)\n",
    "df_counts.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to split the `urlkey` into its separate subdomains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Split the urlkey on commas into separate columns -- this creates a new df\n",
    "df_split = df_counts[\"urlkey\"].str.split(\",\", expand=True)\n",
    "\n",
    "# Merge the new df back with the original so we have both the urlkey and it's components\n",
    "df_merged = pd.merge(df_counts, df_split, left_index=True, right_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we'll stich the subdomains back together in a traditional domain format just for readability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>urlkey</th>\n",
       "      <th>number_of_pages</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>au,gov,qld,rockhampton</td>\n",
       "      <td>10069</td>\n",
       "      <td>au</td>\n",
       "      <td>gov</td>\n",
       "      <td>qld</td>\n",
       "      <td>rockhampton</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>rockhampton.qld.gov.au</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>au,gov,ag,sat</td>\n",
       "      <td>14</td>\n",
       "      <td>au</td>\n",
       "      <td>gov</td>\n",
       "      <td>ag</td>\n",
       "      <td>sat</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>sat.ag.gov.au</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>au,gov,vic,ffm,confluence</td>\n",
       "      <td>5</td>\n",
       "      <td>au</td>\n",
       "      <td>gov</td>\n",
       "      <td>vic</td>\n",
       "      <td>ffm</td>\n",
       "      <td>confluence</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>confluence.ffm.vic.gov.au</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>au,gov,nsw,dumaresq</td>\n",
       "      <td>33</td>\n",
       "      <td>au</td>\n",
       "      <td>gov</td>\n",
       "      <td>nsw</td>\n",
       "      <td>dumaresq</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>dumaresq.nsw.gov.au</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>au,gov,wa,dpc,scienceandinnovation</td>\n",
       "      <td>438</td>\n",
       "      <td>au</td>\n",
       "      <td>gov</td>\n",
       "      <td>wa</td>\n",
       "      <td>dpc</td>\n",
       "      <td>scienceandinnovation</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>scienceandinnovation.dpc.wa.gov.au</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               urlkey  number_of_pages   0    1    2  \\\n",
       "0              au,gov,qld,rockhampton            10069  au  gov  qld   \n",
       "1                       au,gov,ag,sat               14  au  gov   ag   \n",
       "2           au,gov,vic,ffm,confluence                5  au  gov  vic   \n",
       "3                 au,gov,nsw,dumaresq               33  au  gov  nsw   \n",
       "4  au,gov,wa,dpc,scienceandinnovation              438  au  gov   wa   \n",
       "\n",
       "             3                     4     5     6     7     8     9  \\\n",
       "0  rockhampton                  None  None  None  None  None  None   \n",
       "1          sat                  None  None  None  None  None  None   \n",
       "2          ffm            confluence  None  None  None  None  None   \n",
       "3     dumaresq                  None  None  None  None  None  None   \n",
       "4          dpc  scienceandinnovation  None  None  None  None  None   \n",
       "\n",
       "                               domain  \n",
       "0              rockhampton.qld.gov.au  \n",
       "1                       sat.ag.gov.au  \n",
       "2           confluence.ffm.vic.gov.au  \n",
       "3                 dumaresq.nsw.gov.au  \n",
       "4  scienceandinnovation.dpc.wa.gov.au  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def join_domain(x):\n",
    "    parts = x.split(\",\")\n",
    "    parts.reverse()\n",
    "    return \".\".join(parts)\n",
    "\n",
    "\n",
    "df_merged[\"domain\"] = df_merged[\"urlkey\"].apply(join_domain)\n",
    "df_merged.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='domains/gov-au-domains-split-20220406155220.csv' target='_blank'>domains/gov-au-domains-split-20220406155220.csv</a><br>"
      ],
      "text/plain": [
       "/home/tim/Workspace/mycode/glam-workbench/webarchives/notebooks/domains/gov-au-domains-split-20220406155220.csv"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "split_filename = (\n",
    "    f'domains/gov-au-domains-split-{arrow.now().format(\"YYYYMMDDHHmmss\")}.csv'\n",
    ")\n",
    "df_merged.to_csv(split_filename, index=False)\n",
    "display(FileLink(split_filename))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "latest_split = sorted(\n",
    "    list(Path(\"domains\").glob(\"gov-au-domains-split-*\")), reverse=True\n",
    ")[0]\n",
    "df_merged = pd.read_csv(latest_split)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's count things!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many third level domains are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1825"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pd.unique(df_merged[\"2\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which third level domains have the most subdomains?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nsw         8087\n",
       "vic         3796\n",
       "qld         3021\n",
       "wa          2862\n",
       "sa          1892\n",
       "tas         1026\n",
       "nt           792\n",
       "act          385\n",
       "embassy      152\n",
       "nla          140\n",
       "govspace     111\n",
       "ga            98\n",
       "health        84\n",
       "ato           82\n",
       "govcms        79\n",
       "deewr         77\n",
       "treasury      75\n",
       "dest          69\n",
       "abs           68\n",
       "bom           63\n",
       "Name: 2, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged[\"2\"].value_counts()[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which domains have the most unique pages?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "</style>\n",
       "<table id=\"T_e14a8\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_e14a8_level0_col0\" class=\"col_heading level0 col0\" >domain</th>\n",
       "      <th id=\"T_e14a8_level0_col1\" class=\"col_heading level0 col1\" >number_of_pages</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row0\" class=\"row_heading level0 row0\" >6469</th>\n",
       "      <td id=\"T_e14a8_row0_col0\" class=\"data row0 col0\" >trove.nla.gov.au</td>\n",
       "      <td id=\"T_e14a8_row0_col1\" class=\"data row0 col1\" >14,053,042</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row1\" class=\"row_heading level0 row1\" >20708</th>\n",
       "      <td id=\"T_e14a8_row1_col0\" class=\"data row1 col0\" >nla.gov.au</td>\n",
       "      <td id=\"T_e14a8_row1_col1\" class=\"data row1 col1\" >8,852,946</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row2\" class=\"row_heading level0 row2\" >7925</th>\n",
       "      <td id=\"T_e14a8_row2_col0\" class=\"data row2 col0\" >collectionsearch.nma.gov.au</td>\n",
       "      <td id=\"T_e14a8_row2_col1\" class=\"data row2 col1\" >2,548,712</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row3\" class=\"row_heading level0 row3\" >20468</th>\n",
       "      <td id=\"T_e14a8_row3_col0\" class=\"data row3 col0\" >passwordreset.parliament.qld.gov.au</td>\n",
       "      <td id=\"T_e14a8_row3_col1\" class=\"data row3 col1\" >2,089,256</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row4\" class=\"row_heading level0 row4\" >18518</th>\n",
       "      <td id=\"T_e14a8_row4_col0\" class=\"data row4 col0\" >parlinfo.aph.gov.au</td>\n",
       "      <td id=\"T_e14a8_row4_col1\" class=\"data row4 col1\" >2,060,004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row5\" class=\"row_heading level0 row5\" >2400</th>\n",
       "      <td id=\"T_e14a8_row5_col0\" class=\"data row5 col0\" >aph.gov.au</td>\n",
       "      <td id=\"T_e14a8_row5_col1\" class=\"data row5 col1\" >1,776,889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row6\" class=\"row_heading level0 row6\" >28205</th>\n",
       "      <td id=\"T_e14a8_row6_col0\" class=\"data row6 col0\" >bmcc.nsw.gov.au</td>\n",
       "      <td id=\"T_e14a8_row6_col1\" class=\"data row6 col1\" >1,419,442</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row7\" class=\"row_heading level0 row7\" >10710</th>\n",
       "      <td id=\"T_e14a8_row7_col0\" class=\"data row7 col0\" >jobsearch.gov.au</td>\n",
       "      <td id=\"T_e14a8_row7_col1\" class=\"data row7 col1\" >1,294,115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row8\" class=\"row_heading level0 row8\" >27173</th>\n",
       "      <td id=\"T_e14a8_row8_col0\" class=\"data row8 col0\" >arpansa.gov.au</td>\n",
       "      <td id=\"T_e14a8_row8_col1\" class=\"data row8 col1\" >1,279,296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row9\" class=\"row_heading level0 row9\" >27953</th>\n",
       "      <td id=\"T_e14a8_row9_col0\" class=\"data row9 col0\" >abs.gov.au</td>\n",
       "      <td id=\"T_e14a8_row9_col1\" class=\"data row9 col1\" >992,726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row10\" class=\"row_heading level0 row10\" >22648</th>\n",
       "      <td id=\"T_e14a8_row10_col0\" class=\"data row10 col0\" >catalogue.nla.gov.au</td>\n",
       "      <td id=\"T_e14a8_row10_col1\" class=\"data row10 col1\" >987,993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row11\" class=\"row_heading level0 row11\" >7521</th>\n",
       "      <td id=\"T_e14a8_row11_col0\" class=\"data row11 col0\" >libero.gtcc.nsw.gov.au</td>\n",
       "      <td id=\"T_e14a8_row11_col1\" class=\"data row11 col1\" >959,539</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row12\" class=\"row_heading level0 row12\" >28006</th>\n",
       "      <td id=\"T_e14a8_row12_col0\" class=\"data row12 col0\" >canterbury.nsw.gov.au</td>\n",
       "      <td id=\"T_e14a8_row12_col1\" class=\"data row12 col1\" >957,261</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row13\" class=\"row_heading level0 row13\" >13050</th>\n",
       "      <td id=\"T_e14a8_row13_col0\" class=\"data row13 col0\" >library.campbelltown.nsw.gov.au</td>\n",
       "      <td id=\"T_e14a8_row13_col1\" class=\"data row13 col1\" >935,191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row14\" class=\"row_heading level0 row14\" >3709</th>\n",
       "      <td id=\"T_e14a8_row14_col0\" class=\"data row14 col0\" >defencejobs.gov.au</td>\n",
       "      <td id=\"T_e14a8_row14_col1\" class=\"data row14 col1\" >895,158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row15\" class=\"row_heading level0 row15\" >11973</th>\n",
       "      <td id=\"T_e14a8_row15_col0\" class=\"data row15 col0\" >health.gov.au</td>\n",
       "      <td id=\"T_e14a8_row15_col1\" class=\"data row15 col1\" >882,471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row16\" class=\"row_heading level0 row16\" >15637</th>\n",
       "      <td id=\"T_e14a8_row16_col0\" class=\"data row16 col0\" >webopac.gosford.nsw.gov.au</td>\n",
       "      <td id=\"T_e14a8_row16_col1\" class=\"data row16 col1\" >854,803</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row17\" class=\"row_heading level0 row17\" >1550</th>\n",
       "      <td id=\"T_e14a8_row17_col0\" class=\"data row17 col0\" >library.lachlan.nsw.gov.au</td>\n",
       "      <td id=\"T_e14a8_row17_col1\" class=\"data row17 col1\" >838,972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row18\" class=\"row_heading level0 row18\" >24291</th>\n",
       "      <td id=\"T_e14a8_row18_col0\" class=\"data row18 col0\" >accc.gov.au</td>\n",
       "      <td id=\"T_e14a8_row18_col1\" class=\"data row18 col1\" >828,948</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_e14a8_level0_row19\" class=\"row_heading level0 row19\" >17620</th>\n",
       "      <td id=\"T_e14a8_row19_col0\" class=\"data row19 col0\" >data.aad.gov.au</td>\n",
       "      <td id=\"T_e14a8_row19_col1\" class=\"data row19 col1\" >820,263</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x7f2558ab9dc0>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_20 = df_merged[[\"domain\", \"number_of_pages\"]].sort_values(\n",
    "    by=\"number_of_pages\", ascending=False\n",
    ")[:20]\n",
    "top_20.style.format({\"number_of_pages\": \"{:,}\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are there really domains made up of 10 levels?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['lifejacket.prod.apps.blue.prod.cds.transport.nsw.gov.au',\n",
       " '0-slwa.csiro.patron.eb20.com.henrietta.slwa.wa.gov.au',\n",
       " 'etoll.prod.apps.blue.prod.cds.transport.nsw.gov.au',\n",
       " '0-www.library.eb.com.au.henrietta.slwa.wa.gov.au',\n",
       " 'test-your-tired-self-prod.apps.p.dmp.aws.hosting.transport.nsw.gov.au']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged.loc[df_merged[\"9\"].notnull()][\"domain\"].to_list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's visualise things!\n",
    "\n",
    "I thought it would be interesting to try and visualise all the subdomains as a circular dendrogram. After a bit of investigation I discovered the [ETE Toolkit](http://etetoolkit.org/) for the visualisation of phylogenetic trees – it seemed perfect. But to get data into ETE I first had to convert it into a [Newick formatted](https://en.wikipedia.org/wiki/Newick_format) string. Fortunately, there's a [Python package](https://pypi.org/project/newick/) for that.\n",
    "\n",
    "Warning! While the code below will indeed generate circular dendrograms from a domain name hierarchy, if you have more than a few hundred domains you'll find that the image gets very big, very quickly. I successfully saved the whole of the `gov.au` domain as a 32mb SVG file, which you can (very slowly) view in a web browser or graphics program. But any attempt to save into another image format at a size that would make the text readable consumed huge amounts of memory and forced me to pull the plug."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_domain_tree(domains):\n",
    "    \"\"\"\n",
    "    Converts a list of urlkeys into a Newick tree via nodes.\n",
    "    \"\"\"\n",
    "    d_tree = Node()\n",
    "    for domain in domains:\n",
    "        domain = re.sub(r\"\\:\\d+\", \"\", domain)\n",
    "        sds = domain.split(\",\")\n",
    "        for i, sd in enumerate(sds):\n",
    "            parent = \".\".join(reversed(sds[0:i])) if i > 0 else None\n",
    "            label = \".\".join(reversed(sds[: i + 1]))\n",
    "            if not d_tree.get_node(label):\n",
    "                if parent:\n",
    "                    d_tree.get_node(parent).add_descendant(Node(label))\n",
    "                else:\n",
    "                    d_tree.add_descendant(Node(label))\n",
    "    return newick.dumps(d_tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Convert domains to a Newick tree\n",
    "full_tree = make_domain_tree(domains)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def save_dendrogram_to_file(tree, width, output_file):\n",
    "    t = Tree(tree, format=1)\n",
    "    circular_style = TreeStyle()\n",
    "    circular_style.mode = \"c\"  # draw tree in circular mode\n",
    "    circular_style.optimal_scale_level = \"full\"\n",
    "    circular_style.root_opening_factor = 0\n",
    "    circular_style.show_scale = False\n",
    "    t.render(output_file, w=width, tree_style=circular_style)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's play safe by creating a PNG with a fixed width."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Saving a PNG with a fixed width will work, but you won't be able to read any text\n",
    "save_dendrogram_to_file(\n",
    "    full_tree, 1000, f'images/govau-all-{arrow.now().format(\"YYYYMMDDHHmmss\")}-1000.png'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the result!\n",
    "\n",
    "![Circular dendrogram of all gov.au domains](images/govau-all-20220412104624-1000.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will save a zoomable SVG version that allows you to read the labels, but it will be very slow to use, and difficult to convert into other formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Here be dendrodragons!\n",
    "# I don't think width does anything if you save to SVG\n",
    "save_dendrogram_to_file(\n",
    "    full_tree, 5000, f'images/govau-all-{arrow.now().format(\"YYYYMMDDHHmmss\")}.svg'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try some third level domains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_dendrogram(label, level=\"2\", df=df_merged, width=300):\n",
    "    domains = df.loc[df[\"2\"] == label][\"urlkey\"].to_list()\n",
    "    tree = make_domain_tree(domains)\n",
    "    filename = (\n",
    "        f'images/{label}-domains-{arrow.now().format(\"YYYYMMDDHHmmss\")}-{width}.png'\n",
    "    )\n",
    "    save_dendrogram_to_file(\n",
    "        tree,\n",
    "        width,\n",
    "        filename,\n",
    "    )\n",
    "    return f'<div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"{filename}\" style=\"\"><p style=\"text-align: center;\">{label.upper()}</p></div>'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/nsw-domains-20220412104710-300.png\" style=\"\"><p style=\"text-align: center;\">NSW</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/vic-domains-20220412104717-300.png\" style=\"\"><p style=\"text-align: center;\">VIC</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/qld-domains-20220412104721-300.png\" style=\"\"><p style=\"text-align: center;\">QLD</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/sa-domains-20220412104724-300.png\" style=\"\"><p style=\"text-align: center;\">SA</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/wa-domains-20220412104727-300.png\" style=\"\"><p style=\"text-align: center;\">WA</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/tas-domains-20220412104728-300.png\" style=\"\"><p style=\"text-align: center;\">TAS</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/nt-domains-20220412104729-300.png\" style=\"\"><p style=\"text-align: center;\">NT</p></div><div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/act-domains-20220412104729-300.png\" style=\"\"><p style=\"text-align: center;\">ACT</p></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Create dendrograms for each state/territory\n",
    "html = \"\"\n",
    "for state in [\"nsw\", \"vic\", \"qld\", \"sa\", \"wa\", \"tas\", \"nt\", \"act\"]:\n",
    "    html += display_dendrogram(state)\n",
    "\n",
    "display(HTML(html))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there are fewer domains you can see more detail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"width: 300px; float: left; margin-right: 10px;\"><img src=\"images/act-domains-20220412105015-1000.png\" style=\"\"><p style=\"text-align: center;\">ACT</p></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "act = display_dendrogram(state, width=1000)\n",
    "display(HTML(act))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I've generated and saved 5000px wide versions of the national and state dendrograms [in a Cloudstor shared folder](https://cloudstor.aarnet.edu.au/plus/s/F3BjoCaS5U3BHCh)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n",
    "\n",
    "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n",
    "\n",
    "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}