{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Observing change in a web page over time\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "This notebook explores what we can find when you look at all captures of a single page over time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Work in progress – this notebook isn't finished yet. Check back later for more...

" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "vegafusion.enable(mimetype='html', row_limit=30000, embed_options=None)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "import altair as alt\n", "import pandas as pd\n", "import requests\n", "import vegafusion as vf\n", "\n", "# alt.data_transformers.disable_max_rows()\n", "\n", "vf.enable(row_limit=30000)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " response = requests.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "url = \"http://nla.gov.au\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the data\n", "\n", "In this example we're using the IA CDX API, but this could easily be adapted to use [Timemaps](find_all_captures.ipynb) from a range of repositories. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = query_cdx(url)\n", "\n", "# Convert to a dataframe\n", "# The column names are in the first row\n", "df = pd.DataFrame(data[1:], columns=data[0])\n", "\n", "# Convert the timestamp string into a datetime object\n", "df[\"date\"] = pd.to_datetime(df[\"timestamp\"])\n", "df.sort_values(by=\"date\", inplace=True, ignore_index=True)\n", "\n", "# Convert the length from a string into an integer\n", "df[\"length\"] = df[\"length\"].astype(\"int\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in the notebook [comparing the CDX API with Timemaps](getting_all_snapshots_timemap_vs_cdx.ipynb), there are a number of duplicate snapshots in the CDX results, so let's remove them." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: 29691\n", "After: 28461\n" ] } ], "source": [ "print(f\"Before: {df.shape[0]}\")\n", "df.drop_duplicates(\n", " subset=[\"timestamp\", \"original\", \"digest\", \"statuscode\", \"mimetype\"],\n", " keep=\"first\",\n", " inplace=True,\n", ")\n", "print(f\"After: {df.shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The basic shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('1996-10-19 06:42:23')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"date\"].min()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2023-05-01 08:52:34')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"date\"].max()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 28461.000000\n", "mean 1954.959734\n", "std 5181.999394\n", "min 235.000000\n", "25% 327.000000\n", "50% 330.000000\n", "75% 403.000000\n", "max 30062.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"length\"].describe()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "301 18355\n", "- 6326\n", "200 3367\n", "302 410\n", "503 3\n", "Name: statuscode, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"statuscode\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text/html 22133\n", "warc/revisit 6326\n", "unk 2\n", "Name: mimetype, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"mimetype\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting snapshots over time" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "

\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is just a bit of fancy customisation to group the types of errors by color\n", "# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors\n", "domain = [\"-\", \"200\", \"301\", \"302\", \"404\", \"503\"]\n", "# green for ok, blue for redirects, red for errors\n", "range_ = [\"#888888\", \"#39a035\", \"#5ba3cf\", \"#125ca4\", \"#e13128\", \"#b21218\"]\n", "\n", "alt.Chart(df).mark_point().encode(\n", " x=\"date:T\",\n", " y=\"length:Q\",\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " tooltip=[\"date\", \"length\", \"statuscode\"],\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at domains, protocols, and redirects\n", "\n", "Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the `original` column. These are the urls being requested by the archiving bot." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http://nla.gov.au/ 23912\n", "https://www.nla.gov.au/ 1924\n", "http://www.nla.gov.au/ 1415\n", "http://www.nla.gov.au:80/ 868\n", "https://nla.gov.au/ 172\n", "http://nla.gov.au:80/ 77\n", "https://www.nla.gov.au 28\n", "http://www.nla.gov.au// 23\n", "http://www.nla.gov.au 11\n", "http://www2.nla.gov.au:80/ 10\n", "http://Trove@nla.gov.au/ 6\n", "http://www.nla.gov.au./ 4\n", "http://www.nla.gov.au:80/? 2\n", "http://nla.gov.au 2\n", "http://www.nla.gov.au/? 1\n", "http://cmccarthy@nla.gov.au/ 1\n", "http://mailto:media@nla.gov.au/ 1\n", "http://mailto:development@nla.gov.au/ 1\n", "http://mailto:www@nla.gov.au/ 1\n", "http://www.nla.gov.au:80// 1\n", "https://www.nla.gov.au// 1\n", "Name: original, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"original\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed `mailto` links. To look at the differences in more detail, let's create new columns for `subdomain` and `protocol`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " 24173\n", "www 4278\n", "www2 10\n", "Name: subdomain, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_domain = re.search(r\"https*:\\/\\/(\\w*)\\.\", url).group(1)\n", "df[\"subdomain\"] = df[\"original\"].str.extract(\n", " r\"^https*:\\/\\/(\\w*)\\.{}\\.\".format(base_domain), flags=re.IGNORECASE\n", ")\n", "df[\"subdomain\"].fillna(\"\", inplace=True)\n", "df[\"subdomain\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http 26336\n", "https 2125\n", "Name: protocol, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"protocol\"] = df[\"original\"].str.extract(r\"^(https*):\")\n", "df[\"protocol\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change in protocol\n", "\n", "Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df).mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=\"protocol:N\",\n", " # tooltip=['date', 'length', 'subdomain:N']\n", ").properties(width=700, height=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No real surprise there given the increased use of https generally." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by subdomain\n", "\n", "Let's now compare the proportion of status codes between the bare `nla.gov.au` domain and the `www` subdomain." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(\n", " df.loc[(df[\"statuscode\"] != \"-\") & (df[\"subdomain\"] != \"www2\")]\n", ").mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " row=\"subdomain\",\n", " tooltip=[\"year(date):T\", \"statuscode\"],\n", ").properties(\n", " width=700, height=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think we can start to see what's going on. Around about 2004, requests to `nla.gov.au` started to be redirected to `www.nla.gov.au` giving a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) response, indicating that the page had been moved temporarily. But why the growth in [301](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301) (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the `https` protocol, I think we could guess that `http` requests in both domains are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by protocol\n", "\n", "Let's test that hypothesis by looking at the distribution of status codes by protocol." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(\n", " df.loc[(df[\"statuscode\"] != \"-\") & (df[\"subdomain\"] != \"www2\")]\n", ").mark_bar().encode(\n", " x=\"year(date):T\",\n", " y=alt.Y(\"count()\", stack=\"normalize\"),\n", " color=alt.Color(\"statuscode\", scale=alt.Scale(domain=domain, range=range_)),\n", " row=\"protocol\",\n", " tooltip=[\"year(date):T\", \"protocol\", \"statuscode\"],\n", ").properties(\n", " width=700, height=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that by 2019, all requests using `http` are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking for major changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_200 = df.copy().loc[\n", " (df[\"statuscode\"] == \"200\") & (df[\"subdomain\"] == \"www\") & (df[\"length\"] > 1000)\n", "]\n", "\n", "alt.Chart(df_200).mark_point().encode(\n", " x=\"date:T\", y=\"length:Q\", tooltip=[\"date\", \"length\"]\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df_200[\"change_in_length\"] = abs(df_200[\"length\"].diff())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the captures that varied most in length from their predecessor." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlengthdatesubdomainprotocolchange_in_length
3662au,gov,nla)/20210701042826https://www.nla.gov.au/text/html2006PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7292152021-07-01 04:28:26wwwhttps13933.0
24747au,gov,nla)/20230327064440https://www.nla.gov.au/text/html200QMUYKNTWO6E2JOMY4AUCTTVKS4KTKIQR235862023-03-27 06:44:40wwwhttps4779.0
4179au,gov,nla)/20220202054835https://www.nla.gov.au/text/html200O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ274952022-02-02 05:48:35wwwhttps4648.0
4109au,gov,nla)/20220123061020https://www.nla.gov.au/text/html20077AAUDAPKSHUK2ST233PMJGWEZPSKQ53274412022-01-23 06:10:20wwwhttps4513.0
4124au,gov,nla)/20220126082745https://www.nla.gov.au/text/html200NTVX4UULUQWDYZXQN7TYG4AFUBFQR2TT229412022-01-26 08:27:45wwwhttps4500.0
4152au,gov,nla)/20220128184507https://www.nla.gov.au/text/html200HBACKZYKQANIT6GXONE732TA35NUH6EH229492022-01-28 18:45:07wwwhttps4491.0
4150au,gov,nla)/20220128092906https://www.nla.gov.au/text/html200HBACKZYKQANIT6GXONE732TA35NUH6EH274402022-01-28 09:29:06wwwhttps4491.0
4104au,gov,nla)/20220122121955https://www.nla.gov.autext/html200PUJA36QMIDUPFUVWKSFVVBO2KDIB37E2274472022-01-22 12:19:55wwwhttps4486.0
4122au,gov,nla)/20220125193852https://www.nla.gov.au/text/html200M5MC73PMLOIWCMC54CQOCL26ZMLP5IA5274412022-01-25 19:38:52wwwhttps4485.0
4111au,gov,nla)/20220123064328https://www.nla.gov.au/text/html20077AAUDAPKSHUK2ST233PMJGWEZPSKQ53229572022-01-23 06:43:28wwwhttps4484.0
\n", "
" ], "text/plain": [ " urlkey timestamp original mimetype \\\n", "3662 au,gov,nla)/ 20210701042826 https://www.nla.gov.au/ text/html \n", "24747 au,gov,nla)/ 20230327064440 https://www.nla.gov.au/ text/html \n", "4179 au,gov,nla)/ 20220202054835 https://www.nla.gov.au/ text/html \n", "4109 au,gov,nla)/ 20220123061020 https://www.nla.gov.au/ text/html \n", "4124 au,gov,nla)/ 20220126082745 https://www.nla.gov.au/ text/html \n", "4152 au,gov,nla)/ 20220128184507 https://www.nla.gov.au/ text/html \n", "4150 au,gov,nla)/ 20220128092906 https://www.nla.gov.au/ text/html \n", "4104 au,gov,nla)/ 20220122121955 https://www.nla.gov.au text/html \n", "4122 au,gov,nla)/ 20220125193852 https://www.nla.gov.au/ text/html \n", "4111 au,gov,nla)/ 20220123064328 https://www.nla.gov.au/ text/html \n", "\n", " statuscode digest length \\\n", "3662 200 6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 29215 \n", "24747 200 QMUYKNTWO6E2JOMY4AUCTTVKS4KTKIQR 23586 \n", "4179 200 O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ 27495 \n", "4109 200 77AAUDAPKSHUK2ST233PMJGWEZPSKQ53 27441 \n", "4124 200 NTVX4UULUQWDYZXQN7TYG4AFUBFQR2TT 22941 \n", "4152 200 HBACKZYKQANIT6GXONE732TA35NUH6EH 22949 \n", "4150 200 HBACKZYKQANIT6GXONE732TA35NUH6EH 27440 \n", "4104 200 PUJA36QMIDUPFUVWKSFVVBO2KDIB37E2 27447 \n", "4122 200 M5MC73PMLOIWCMC54CQOCL26ZMLP5IA5 27441 \n", "4111 200 77AAUDAPKSHUK2ST233PMJGWEZPSKQ53 22957 \n", "\n", " date subdomain protocol change_in_length \n", "3662 2021-07-01 04:28:26 www https 13933.0 \n", "24747 2023-03-27 06:44:40 www https 4779.0 \n", "4179 2022-02-02 05:48:35 www https 4648.0 \n", "4109 2022-01-23 06:10:20 www https 4513.0 \n", "4124 2022-01-26 08:27:45 www https 4500.0 \n", "4152 2022-01-28 18:45:07 www https 4491.0 \n", "4150 2022-01-28 09:29:06 www https 4491.0 \n", "4104 2022-01-22 12:19:55 www https 4486.0 \n", "4122 2022-01-25 19:38:52 www https 4485.0 \n", "4111 2022-01-23 06:43:28 www https 4484.0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_ten_changes = df_200.sort_values(by=\"change_in_length\", ascending=False)[:10]\n", "top_ten_changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try visualising this by highlighting the major changes in length." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "points = (\n", " alt.Chart(df_200)\n", " .mark_point()\n", " .encode(x=\"date:T\", y=\"length:Q\", tooltip=[\"date\", \"length\"])\n", " .properties(width=700, height=300)\n", ")\n", "\n", "lines = (\n", " alt.Chart(top_ten_changes)\n", " .mark_rule(color=\"red\")\n", " .encode(x=\"date:T\", tooltip=[\"date\"])\n", " .properties(width=700, height=300)\n", ")\n", "\n", "points + lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than just a raw number, perhaps the percentage change in length would be more useful. Once again, Pandas makes this easy to calculate. This calculates the percentage change from the previous value – so length2 - length1 / length1." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "df_200[\"pct_change_in_length\"] = abs(df_200[\"length\"].pct_change())" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlkeytimestamporiginalmimetypestatuscodedigestlengthdatesubdomainprotocolchange_in_lengthpct_change_in_length
3662au,gov,nla)/20210701042826https://www.nla.gov.au/text/html2006PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7292152021-07-01 04:28:26wwwhttps13933.00.911726
13au,gov,nla)/19980205162107http://www.nla.gov.au:80/text/html200LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X19201998-02-05 16:21:07wwwhttp757.00.650903
79au,gov,nla)/20011003175018http://www.nla.gov.au:80/text/html200BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN33672001-10-03 17:50:18wwwhttp1004.00.424884
1519au,gov,nla)/20160901112433http://www.nla.gov.au/text/html200MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2115412016-09-01 11:24:33wwwhttp2738.00.311030
1184au,gov,nla)/20130211044309http://www.nla.gov.au/text/html200QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN85212013-02-11 04:43:09wwwhttp1698.00.248864
1067au,gov,nla)/20110611064218http://www.nla.gov.au/text/html200Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT56012011-06-11 06:42:18wwwhttp1739.00.236921
2049au,gov,nla)/20181212014241https://www.nla.gov.au/text/html200C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM148132018-12-12 01:42:41wwwhttps2831.00.236271
786au,gov,nla)/20061107083938http://www.nla.gov.au:80/text/html200HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF56622006-11-07 08:39:38wwwhttp1561.00.216115
4179au,gov,nla)/20220202054835https://www.nla.gov.au/text/html200O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ274952022-02-02 05:48:35wwwhttps4648.00.203440
3890au,gov,nla)/20211121150551https://www.nla.gov.au/text/html200GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY259122021-11-21 15:05:51wwwhttps4316.00.199852
\n", "
" ], "text/plain": [ " urlkey timestamp original mimetype \\\n", "3662 au,gov,nla)/ 20210701042826 https://www.nla.gov.au/ text/html \n", "13 au,gov,nla)/ 19980205162107 http://www.nla.gov.au:80/ text/html \n", "79 au,gov,nla)/ 20011003175018 http://www.nla.gov.au:80/ text/html \n", "1519 au,gov,nla)/ 20160901112433 http://www.nla.gov.au/ text/html \n", "1184 au,gov,nla)/ 20130211044309 http://www.nla.gov.au/ text/html \n", "1067 au,gov,nla)/ 20110611064218 http://www.nla.gov.au/ text/html \n", "2049 au,gov,nla)/ 20181212014241 https://www.nla.gov.au/ text/html \n", "786 au,gov,nla)/ 20061107083938 http://www.nla.gov.au:80/ text/html \n", "4179 au,gov,nla)/ 20220202054835 https://www.nla.gov.au/ text/html \n", "3890 au,gov,nla)/ 20211121150551 https://www.nla.gov.au/ text/html \n", "\n", " statuscode digest length date \\\n", "3662 200 6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 29215 2021-07-01 04:28:26 \n", "13 200 LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X 1920 1998-02-05 16:21:07 \n", "79 200 BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN 3367 2001-10-03 17:50:18 \n", "1519 200 MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2 11541 2016-09-01 11:24:33 \n", "1184 200 QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN 8521 2013-02-11 04:43:09 \n", "1067 200 Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT 5601 2011-06-11 06:42:18 \n", "2049 200 C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM 14813 2018-12-12 01:42:41 \n", "786 200 HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF 5662 2006-11-07 08:39:38 \n", "4179 200 O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ 27495 2022-02-02 05:48:35 \n", "3890 200 GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY 25912 2021-11-21 15:05:51 \n", "\n", " subdomain protocol change_in_length pct_change_in_length \n", "3662 www https 13933.0 0.911726 \n", "13 www http 757.0 0.650903 \n", "79 www http 1004.0 0.424884 \n", "1519 www http 2738.0 0.311030 \n", "1184 www http 1698.0 0.248864 \n", "1067 www http 1739.0 0.236921 \n", "2049 www https 2831.0 0.236271 \n", "786 www http 1561.0 0.216115 \n", "4179 www https 4648.0 0.203440 \n", "3890 www https 4316.0 0.199852 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_ten_changes_pct = df_200.sort_values(by=\"pct_change_in_length\", ascending=False)[\n", " :10\n", "]\n", "top_ten_changes_pct" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = (\n", " alt.Chart(top_ten_changes_pct)\n", " .mark_rule(color=\"red\")\n", " .encode(x=\"date:T\", tooltip=[\"date\"])\n", " .properties(width=700, height=300)\n", ")\n", "\n", "points + lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By focusing on percentage difference we can see that more prominence is given to the change in 2001. But rather than just the top 10, should we look at changes greater than 10% or some other threshold?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = (\n", " alt.Chart(df_200.loc[df_200[\"pct_change_in_length\"] > 0.1])\n", " .mark_rule(color=\"red\")\n", " .encode(x=\"date:T\", tooltip=[\"date\"])\n", " .properties(width=700, height=300)\n", ")\n", "\n", "points + lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other possibilities to explore\n", "\n", "* Rate of change – what proportion of the snapshots each year are *different*?\n", "* Use similarity measures to identify changes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing individual captures\n", "\n", "Once major changes, such as those above, have been identified, we can use some of the other notebooks in this repository to compare individual captures. For example:\n", "\n", "* [Compare two versions of an archived web page](show_diffs.ipynb)\n", "* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }