{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Observing change in a web page over time\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "This notebook explores what we can find when you look at all captures of a single page over time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Work in progress – this notebook isn't finished yet. Check back later for more...
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "import altair as alt\n", "import pandas as pd\n", "import requests" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " response = requests.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "url = \"http://nla.gov.au\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the data\n", "\n", "In this example we're using the IA CDX API, but this could easily be adapted to use [Timemaps](find_all_captures.ipynb) from a range of repositories. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = query_cdx(url)\n", "\n", "# Convert to a dataframe\n", "# The column names are in the first row\n", "df = pd.DataFrame(data[1:], columns=data[0])\n", "\n", "# Convert the timestamp string into a datetime object\n", "df[\"date\"] = pd.to_datetime(df[\"timestamp\"])\n", "df.sort_values(by=\"date\", inplace=True, ignore_index=True)\n", "\n", "# Convert the length from a string into an integer\n", "df[\"length\"] = df[\"length\"].astype(\"int\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in the notebook [comparing the CDX API with Timemaps](getting_all_snapshots_timemap_vs_cdx.ipynb), there are a number of duplicate snapshots in the CDX results, so let's remove them." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: 4451\n", "After: 4350\n" ] } ], "source": [ "print(f\"Before: {df.shape[0]}\")\n", "df.drop_duplicates(\n", " subset=[\"timestamp\", \"original\", \"digest\", \"statuscode\", \"mimetype\"],\n", " keep=\"first\",\n", " inplace=True,\n", ")\n", "print(f\"After: {df.shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The basic shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('1996-10-19 06:42:23')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"date\"].min()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2022-04-10 18:38:06')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"date\"].max()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 4350.000000\n", "mean 8318.689655\n", "std 7854.281544\n", "min 235.000000\n", "25% 533.000000\n", "50% 5699.000000\n", "75% 14852.750000\n", "max 30062.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"length\"].describe()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200 2948\n", "301 775\n", "- 315\n", "302 309\n", "503 3\n", "Name: statuscode, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"statuscode\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text/html 4033\n", "warc/revisit 315\n", "unk 2\n", "Name: mimetype, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"mimetype\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting snapshots over time" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is just a bit of fancy customisation to group the types of errors by color\n", "# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors\n", "domain = [\"-\", \"200\", \"301\", \"302\", \"404\", \"503\"]\n", "# green for ok, blue for redirects, red for errors\n", "range_ = [\"#888888\", \"#39a035\", \"#5ba3cf\", \"#125ca4\", \"#e13128\", \"#b21218\"]\n", "\n", "alt.Chart(df).mark_point().encode(\n", " x=\"date:T\",\n", " y=\"length:Q\",\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " tooltip=[\"date\", \"length\", \"statuscode\"],\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at domains, protocols, and redirects\n", "\n", "Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the `original` column. These are the urls being requested by the archiving bot." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "https://www.nla.gov.au/ 1508\n", "http://www.nla.gov.au/ 1178\n", "http://www.nla.gov.au:80/ 868\n", "http://nla.gov.au/ 588\n", "http://nla.gov.au:80/ 77\n", "https://nla.gov.au/ 62\n", "http://www.nla.gov.au// 21\n", "http://www.nla.gov.au 11\n", "http://www2.nla.gov.au:80/ 10\n", "https://www.nla.gov.au 10\n", "http://Trove@nla.gov.au/ 6\n", "http://www.nla.gov.au:80/? 2\n", "http://www.nla.gov.au./ 2\n", "http://nla.gov.au 1\n", "http://mailto:media@nla.gov.au/ 1\n", "http://cmccarthy@nla.gov.au/ 1\n", "http://mailto:development@nla.gov.au/ 1\n", "http://mailto:www@nla.gov.au/ 1\n", "http://www.nla.gov.au:80// 1\n", "http://www.nla.gov.au/? 1\n", "Name: original, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"original\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed `mailto` links. To look at the differences in more detail, let's create new columns for `subdomain` and `protocol`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "www 3602\n", " 738\n", "www2 10\n", "Name: subdomain, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_domain = re.search(r\"https*:\\/\\/(\\w*)\\.\", url).group(1)\n", "df[\"subdomain\"] = df[\"original\"].str.extract(\n", " r\"^https*:\\/\\/(\\w*)\\.{}\\.\".format(base_domain), flags=re.IGNORECASE\n", ")\n", "df[\"subdomain\"].fillna(\"\", inplace=True)\n", "df[\"subdomain\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http 2770\n", "https 1580\n", "Name: protocol, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"protocol\"] = df[\"original\"].str.extract(r\"^(https*):\")\n", "df[\"protocol\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change in protocol\n", "\n", "Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df).mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=\"protocol:N\",\n", " # tooltip=['date', 'length', 'subdomain:N']\n", ").properties(width=700, height=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No real surprise there given the increased use of https generally." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by subdomain\n", "\n", "Let's now compare the proportion of status codes between the bare `nla.gov.au` domain and the `www` subdomain." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(\n", " df.loc[(df[\"statuscode\"] != \"-\") & (df[\"subdomain\"] != \"www2\")]\n", ").mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " row=\"subdomain\",\n", " tooltip=[\"year(date):T\", \"statuscode\"],\n", ").properties(\n", " width=700, height=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think we can start to see what's going on. Around about 2004, requests to `nla.gov.au` started to be redirected to `www.nla.gov.au` giving a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) response, indicating that the page had been moved temporarily. But why the growth in [301](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301) (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the `https` protocol, I think we could guess that `http` requests in both domains are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by protocol\n", "\n", "Let's test that hypothesis by looking at the distribution of status codes by protocol." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(\n", " df.loc[(df[\"statuscode\"] != \"-\") & (df[\"subdomain\"] != \"www2\")]\n", ").mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " row=\"protocol\",\n", " tooltip=[\"year(date):T\", \"protocol\", \"statuscode\"],\n", ").properties(\n", " width=700, height=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that by 2019, all requests using `http` are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking for major changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_200 = df.copy().loc[\n", " (df[\"statuscode\"] == \"200\") & (df[\"subdomain\"] == \"www\") & (df[\"length\"] > 1000)\n", "]\n", "\n", "alt.Chart(df_200).mark_point().encode(\n", " x=\"date:T\", y=\"length:Q\", tooltip=[\"date\", \"length\"]\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df_200[\"change_in_length\"] = abs(df_200[\"length\"].diff())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the captures that varied most in length from their predecessor." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "subdomain | \n", "protocol | \n", "change_in_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
3656 | \n", "au,gov,nla)/ | \n", "20210701042826 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 | \n", "29215 | \n", "2021-07-01 04:28:26 | \n", "www | \n", "https | \n", "13933.0 | \n", "
4134 | \n", "au,gov,nla)/ | \n", "20220202054835 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ | \n", "27495 | \n", "2022-02-02 05:48:35 | \n", "www | \n", "https | \n", "4648.0 | \n", "
3954 | \n", "au,gov,nla)/ | \n", "20220105025646 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "MRIUSTANGOWT3CT5QSSRJ7NJPEN2RSEN | \n", "27273 | \n", "2022-01-05 02:56:46 | \n", "www | \n", "https | \n", "4463.0 | \n", "
4058 | \n", "au,gov,nla)/ | \n", "20220121065839 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "UAGVH7YN6ZPJYQUIZTP32G4GF3JJ2N7J | \n", "22948 | \n", "2022-01-21 06:58:39 | \n", "www | \n", "https | \n", "4394.0 | \n", "
4417 | \n", "au,gov,nla)/ | \n", "20220405063728 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "HXSLRIVPKI3ECPC5V6NNEJOIHTDKZ5JJ | \n", "22682 | \n", "2022-04-05 06:37:28 | \n", "www | \n", "https | \n", "4375.0 | \n", "
3921 | \n", "au,gov,nla)/ | \n", "20211228211936 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "FQZBN2C7DPFPC26F6HX3KDNGWCWVOWWX | \n", "22831 | \n", "2021-12-28 21:19:36 | \n", "www | \n", "https | \n", "4374.0 | \n", "
3946 | \n", "au,gov,nla)/ | \n", "20220103064507 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "PUJLOJI7OUJ4XFKUDI47HOQLFOMEQLKJ | \n", "22917 | \n", "2022-01-03 06:45:07 | \n", "www | \n", "https | \n", "4367.0 | \n", "
4322 | \n", "au,gov,nla)/ | \n", "20220323022916 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "WEP3PTHC7CAEF22S3NDIMZHFDAQIK65J | \n", "26919 | \n", "2022-03-23 02:29:16 | \n", "www | \n", "https | \n", "4359.0 | \n", "
3939 | \n", "au,gov,nla)/ | \n", "20220102132710 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "SBM5ATRRZWVT7HYA6J3BMMOXG4HTDZFD | \n", "27251 | \n", "2022-01-02 13:27:10 | \n", "www | \n", "https | \n", "4352.0 | \n", "
4215 | \n", "au,gov,nla)/ | \n", "20220224211329 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "WCIDXIQ22M35PWXUGK7GCXG2LEJUAJ5J | \n", "23208 | \n", "2022-02-24 21:13:29 | \n", "www | \n", "https | \n", "4351.0 | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "subdomain | \n", "protocol | \n", "change_in_length | \n", "pct_change_in_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
3656 | \n", "au,gov,nla)/ | \n", "20210701042826 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 | \n", "29215 | \n", "2021-07-01 04:28:26 | \n", "www | \n", "https | \n", "13933.0 | \n", "0.911726 | \n", "
13 | \n", "au,gov,nla)/ | \n", "19980205162107 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X | \n", "1920 | \n", "1998-02-05 16:21:07 | \n", "www | \n", "http | \n", "757.0 | \n", "0.650903 | \n", "
79 | \n", "au,gov,nla)/ | \n", "20011003175018 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN | \n", "3367 | \n", "2001-10-03 17:50:18 | \n", "www | \n", "http | \n", "1004.0 | \n", "0.424884 | \n", "
1519 | \n", "au,gov,nla)/ | \n", "20160901112433 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2 | \n", "11541 | \n", "2016-09-01 11:24:33 | \n", "www | \n", "http | \n", "2738.0 | \n", "0.311030 | \n", "
1184 | \n", "au,gov,nla)/ | \n", "20130211044309 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN | \n", "8521 | \n", "2013-02-11 04:43:09 | \n", "www | \n", "http | \n", "1698.0 | \n", "0.248864 | \n", "
1067 | \n", "au,gov,nla)/ | \n", "20110611064218 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT | \n", "5601 | \n", "2011-06-11 06:42:18 | \n", "www | \n", "http | \n", "1739.0 | \n", "0.236921 | \n", "
2049 | \n", "au,gov,nla)/ | \n", "20181212014241 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM | \n", "14813 | \n", "2018-12-12 01:42:41 | \n", "www | \n", "https | \n", "2831.0 | \n", "0.236271 | \n", "
786 | \n", "au,gov,nla)/ | \n", "20061107083938 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF | \n", "5662 | \n", "2006-11-07 08:39:38 | \n", "www | \n", "http | \n", "1561.0 | \n", "0.216115 | \n", "
4134 | \n", "au,gov,nla)/ | \n", "20220202054835 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ | \n", "27495 | \n", "2022-02-02 05:48:35 | \n", "www | \n", "https | \n", "4648.0 | \n", "0.203440 | \n", "
3864 | \n", "au,gov,nla)/ | \n", "20211121150551 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY | \n", "25912 | \n", "2021-11-21 15:05:51 | \n", "www | \n", "https | \n", "4316.0 | \n", "0.199852 | \n", "