{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Observing change in a web page over time\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "This notebook explores what we can find when you look at all captures of a single page over time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Work in progress – this notebook isn't finished yet. Check back later for more...
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "vegafusion.enable(mimetype='html', row_limit=30000, embed_options=None)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "import altair as alt\n", "import pandas as pd\n", "import requests\n", "import vegafusion as vf\n", "\n", "# alt.data_transformers.disable_max_rows()\n", "\n", "vf.enable(row_limit=30000)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " response = requests.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "url = \"http://nla.gov.au\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the data\n", "\n", "In this example we're using the IA CDX API, but this could easily be adapted to use [Timemaps](find_all_captures.ipynb) from a range of repositories. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = query_cdx(url)\n", "\n", "# Convert to a dataframe\n", "# The column names are in the first row\n", "df = pd.DataFrame(data[1:], columns=data[0])\n", "\n", "# Convert the timestamp string into a datetime object\n", "df[\"date\"] = pd.to_datetime(df[\"timestamp\"])\n", "df.sort_values(by=\"date\", inplace=True, ignore_index=True)\n", "\n", "# Convert the length from a string into an integer\n", "df[\"length\"] = df[\"length\"].astype(\"int\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in the notebook [comparing the CDX API with Timemaps](getting_all_snapshots_timemap_vs_cdx.ipynb), there are a number of duplicate snapshots in the CDX results, so let's remove them." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: 29691\n", "After: 28461\n" ] } ], "source": [ "print(f\"Before: {df.shape[0]}\")\n", "df.drop_duplicates(\n", " subset=[\"timestamp\", \"original\", \"digest\", \"statuscode\", \"mimetype\"],\n", " keep=\"first\",\n", " inplace=True,\n", ")\n", "print(f\"After: {df.shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The basic shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('1996-10-19 06:42:23')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"date\"].min()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2023-05-01 08:52:34')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"date\"].max()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 28461.000000\n", "mean 1954.959734\n", "std 5181.999394\n", "min 235.000000\n", "25% 327.000000\n", "50% 330.000000\n", "75% 403.000000\n", "max 30062.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"length\"].describe()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "301 18355\n", "- 6326\n", "200 3367\n", "302 410\n", "503 3\n", "Name: statuscode, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"statuscode\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text/html 22133\n", "warc/revisit 6326\n", "unk 2\n", "Name: mimetype, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"mimetype\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting snapshots over time" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is just a bit of fancy customisation to group the types of errors by color\n", "# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors\n", "domain = [\"-\", \"200\", \"301\", \"302\", \"404\", \"503\"]\n", "# green for ok, blue for redirects, red for errors\n", "range_ = [\"#888888\", \"#39a035\", \"#5ba3cf\", \"#125ca4\", \"#e13128\", \"#b21218\"]\n", "\n", "alt.Chart(df).mark_point().encode(\n", " x=\"date:T\",\n", " y=\"length:Q\",\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " tooltip=[\"date\", \"length\", \"statuscode\"],\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at domains, protocols, and redirects\n", "\n", "Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the `original` column. These are the urls being requested by the archiving bot." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http://nla.gov.au/ 23912\n", "https://www.nla.gov.au/ 1924\n", "http://www.nla.gov.au/ 1415\n", "http://www.nla.gov.au:80/ 868\n", "https://nla.gov.au/ 172\n", "http://nla.gov.au:80/ 77\n", "https://www.nla.gov.au 28\n", "http://www.nla.gov.au// 23\n", "http://www.nla.gov.au 11\n", "http://www2.nla.gov.au:80/ 10\n", "http://Trove@nla.gov.au/ 6\n", "http://www.nla.gov.au./ 4\n", "http://www.nla.gov.au:80/? 2\n", "http://nla.gov.au 2\n", "http://www.nla.gov.au/? 1\n", "http://cmccarthy@nla.gov.au/ 1\n", "http://mailto:media@nla.gov.au/ 1\n", "http://mailto:development@nla.gov.au/ 1\n", "http://mailto:www@nla.gov.au/ 1\n", "http://www.nla.gov.au:80// 1\n", "https://www.nla.gov.au// 1\n", "Name: original, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"original\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed `mailto` links. To look at the differences in more detail, let's create new columns for `subdomain` and `protocol`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " 24173\n", "www 4278\n", "www2 10\n", "Name: subdomain, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_domain = re.search(r\"https*:\\/\\/(\\w*)\\.\", url).group(1)\n", "df[\"subdomain\"] = df[\"original\"].str.extract(\n", " r\"^https*:\\/\\/(\\w*)\\.{}\\.\".format(base_domain), flags=re.IGNORECASE\n", ")\n", "df[\"subdomain\"].fillna(\"\", inplace=True)\n", "df[\"subdomain\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http 26336\n", "https 2125\n", "Name: protocol, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"protocol\"] = df[\"original\"].str.extract(r\"^(https*):\")\n", "df[\"protocol\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change in protocol\n", "\n", "Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df).mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=\"protocol:N\",\n", " # tooltip=['date', 'length', 'subdomain:N']\n", ").properties(width=700, height=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No real surprise there given the increased use of https generally." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by subdomain\n", "\n", "Let's now compare the proportion of status codes between the bare `nla.gov.au` domain and the `www` subdomain." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(\n", " df.loc[(df[\"statuscode\"] != \"-\") & (df[\"subdomain\"] != \"www2\")]\n", ").mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " row=\"subdomain\",\n", " tooltip=[\"year(date):T\", \"statuscode\"],\n", ").properties(\n", " width=700, height=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think we can start to see what's going on. Around about 2004, requests to `nla.gov.au` started to be redirected to `www.nla.gov.au` giving a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) response, indicating that the page had been moved temporarily. But why the growth in [301](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301) (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the `https` protocol, I think we could guess that `http` requests in both domains are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by protocol\n", "\n", "Let's test that hypothesis by looking at the distribution of status codes by protocol." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(\n", " df.loc[(df[\"statuscode\"] != \"-\") & (df[\"subdomain\"] != \"www2\")]\n", ").mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " row=\"protocol\",\n", " tooltip=[\"year(date):T\", \"protocol\", \"statuscode\"],\n", ").properties(\n", " width=700, height=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that by 2019, all requests using `http` are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking for major changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_200 = df.copy().loc[\n", " (df[\"statuscode\"] == \"200\") & (df[\"subdomain\"] == \"www\") & (df[\"length\"] > 1000)\n", "]\n", "\n", "alt.Chart(df_200).mark_point().encode(\n", " x=\"date:T\", y=\"length:Q\", tooltip=[\"date\", \"length\"]\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df_200[\"change_in_length\"] = abs(df_200[\"length\"].diff())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the captures that varied most in length from their predecessor." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "subdomain | \n", "protocol | \n", "change_in_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
3662 | \n", "au,gov,nla)/ | \n", "20210701042826 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 | \n", "29215 | \n", "2021-07-01 04:28:26 | \n", "www | \n", "https | \n", "13933.0 | \n", "
24747 | \n", "au,gov,nla)/ | \n", "20230327064440 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "QMUYKNTWO6E2JOMY4AUCTTVKS4KTKIQR | \n", "23586 | \n", "2023-03-27 06:44:40 | \n", "www | \n", "https | \n", "4779.0 | \n", "
4179 | \n", "au,gov,nla)/ | \n", "20220202054835 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ | \n", "27495 | \n", "2022-02-02 05:48:35 | \n", "www | \n", "https | \n", "4648.0 | \n", "
4109 | \n", "au,gov,nla)/ | \n", "20220123061020 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "77AAUDAPKSHUK2ST233PMJGWEZPSKQ53 | \n", "27441 | \n", "2022-01-23 06:10:20 | \n", "www | \n", "https | \n", "4513.0 | \n", "
4124 | \n", "au,gov,nla)/ | \n", "20220126082745 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "NTVX4UULUQWDYZXQN7TYG4AFUBFQR2TT | \n", "22941 | \n", "2022-01-26 08:27:45 | \n", "www | \n", "https | \n", "4500.0 | \n", "
4152 | \n", "au,gov,nla)/ | \n", "20220128184507 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "HBACKZYKQANIT6GXONE732TA35NUH6EH | \n", "22949 | \n", "2022-01-28 18:45:07 | \n", "www | \n", "https | \n", "4491.0 | \n", "
4150 | \n", "au,gov,nla)/ | \n", "20220128092906 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "HBACKZYKQANIT6GXONE732TA35NUH6EH | \n", "27440 | \n", "2022-01-28 09:29:06 | \n", "www | \n", "https | \n", "4491.0 | \n", "
4104 | \n", "au,gov,nla)/ | \n", "20220122121955 | \n", "https://www.nla.gov.au | \n", "text/html | \n", "200 | \n", "PUJA36QMIDUPFUVWKSFVVBO2KDIB37E2 | \n", "27447 | \n", "2022-01-22 12:19:55 | \n", "www | \n", "https | \n", "4486.0 | \n", "
4122 | \n", "au,gov,nla)/ | \n", "20220125193852 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "M5MC73PMLOIWCMC54CQOCL26ZMLP5IA5 | \n", "27441 | \n", "2022-01-25 19:38:52 | \n", "www | \n", "https | \n", "4485.0 | \n", "
4111 | \n", "au,gov,nla)/ | \n", "20220123064328 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "77AAUDAPKSHUK2ST233PMJGWEZPSKQ53 | \n", "22957 | \n", "2022-01-23 06:43:28 | \n", "www | \n", "https | \n", "4484.0 | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "subdomain | \n", "protocol | \n", "change_in_length | \n", "pct_change_in_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
3662 | \n", "au,gov,nla)/ | \n", "20210701042826 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 | \n", "29215 | \n", "2021-07-01 04:28:26 | \n", "www | \n", "https | \n", "13933.0 | \n", "0.911726 | \n", "
13 | \n", "au,gov,nla)/ | \n", "19980205162107 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X | \n", "1920 | \n", "1998-02-05 16:21:07 | \n", "www | \n", "http | \n", "757.0 | \n", "0.650903 | \n", "
79 | \n", "au,gov,nla)/ | \n", "20011003175018 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN | \n", "3367 | \n", "2001-10-03 17:50:18 | \n", "www | \n", "http | \n", "1004.0 | \n", "0.424884 | \n", "
1519 | \n", "au,gov,nla)/ | \n", "20160901112433 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2 | \n", "11541 | \n", "2016-09-01 11:24:33 | \n", "www | \n", "http | \n", "2738.0 | \n", "0.311030 | \n", "
1184 | \n", "au,gov,nla)/ | \n", "20130211044309 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN | \n", "8521 | \n", "2013-02-11 04:43:09 | \n", "www | \n", "http | \n", "1698.0 | \n", "0.248864 | \n", "
1067 | \n", "au,gov,nla)/ | \n", "20110611064218 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT | \n", "5601 | \n", "2011-06-11 06:42:18 | \n", "www | \n", "http | \n", "1739.0 | \n", "0.236921 | \n", "
2049 | \n", "au,gov,nla)/ | \n", "20181212014241 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM | \n", "14813 | \n", "2018-12-12 01:42:41 | \n", "www | \n", "https | \n", "2831.0 | \n", "0.236271 | \n", "
786 | \n", "au,gov,nla)/ | \n", "20061107083938 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF | \n", "5662 | \n", "2006-11-07 08:39:38 | \n", "www | \n", "http | \n", "1561.0 | \n", "0.216115 | \n", "
4179 | \n", "au,gov,nla)/ | \n", "20220202054835 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ | \n", "27495 | \n", "2022-02-02 05:48:35 | \n", "www | \n", "https | \n", "4648.0 | \n", "0.203440 | \n", "
3890 | \n", "au,gov,nla)/ | \n", "20211121150551 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY | \n", "25912 | \n", "2021-11-21 15:05:51 | \n", "www | \n", "https | \n", "4316.0 | \n", "0.199852 | \n", "