# Observing change in a web page over time

<p class="alert alert-info">New to Jupyter notebooks? Try <a href="getting-started/Using_Jupyter_notebooks.ipynb"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>

This notebook explores what we can find when you look at all captures of a single page over time.

<p class="alert alert-warning">Work in progress â€“ this notebook isn't finished yet. Check back later for more...<p>

In [1]:
import re

import altair as alt
import pandas as pd
import requests

In [2]:
def query_cdx(url, **kwargs):
    params = kwargs
    params["url"] = url
    params["output"] = "json"
    response = requests.get(
        "http://web.archive.org/cdx/search/cdx",
        params=params,
        headers={"User-Agent": ""},
    )
    response.raise_for_status()
    return response.json()

In [3]:
url = "http://nla.gov.au"

## Getting the data

In this example we're using the IA CDX API, but this could easily be adapted to use [Timemaps](find_all_captures.ipynb) from a range of repositories. 

In [4]:
data = query_cdx(url)

# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])

# Convert the timestamp string into a datetime object
df["date"] = pd.to_datetime(df["timestamp"])
df.sort_values(by="date", inplace=True, ignore_index=True)

# Convert the length from a string into an integer
df["length"] = df["length"].astype("int")

As noted in the notebook [comparing the CDX API with Timemaps](getting_all_snapshots_timemap_vs_cdx.ipynb), there are a number of duplicate snapshots in the CDX results, so let's remove them.

In [5]:
print(f"Before: {df.shape[0]}")
df.drop_duplicates(
    subset=["timestamp", "original", "digest", "statuscode", "mimetype"],
    keep="first",
    inplace=True,
)
print(f"After: {df.shape[0]}")

Before: 4451
After: 4350


## The basic shape

In [6]:
df["date"].min()

Timestamp('1996-10-19 06:42:23')

In [7]:
df["date"].max()

Timestamp('2022-04-10 18:38:06')

In [8]:
df["length"].describe()

count     4350.000000
mean      8318.689655
std       7854.281544
min        235.000000
25%        533.000000
50%       5699.000000
75%      14852.750000
max      30062.000000
Name: length, dtype: float64

In [9]:
df["statuscode"].value_counts()

200    2948
301     775
-       315
302     309
503       3
Name: statuscode, dtype: int64

In [10]:
df["mimetype"].value_counts()

text/html       4033
warc/revisit     315
unk                2
Name: mimetype, dtype: int64

## Plotting snapshots over time

In [11]:
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ["-", "200", "301", "302", "404", "503"]
# green for ok, blue for redirects, red for errors
range_ = ["#888888", "#39a035", "#5ba3cf", "#125ca4", "#e13128", "#b21218"]

alt.Chart(df).mark_point().encode(
    x="date:T",
    y="length:Q",
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    tooltip=["date", "length", "statuscode"],
).properties(width=700, height=300)

## Looking at domains, protocols, and redirects

Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the `original` column. These are the urls being requested by the archiving bot.

In [12]:
df["original"].value_counts()

https://www.nla.gov.au/                  1508
http://www.nla.gov.au/                   1178
http://www.nla.gov.au:80/                 868
http://nla.gov.au/                        588
http://nla.gov.au:80/                      77
https://nla.gov.au/                        62
http://www.nla.gov.au//                    21
http://www.nla.gov.au                      11
http://www2.nla.gov.au:80/                 10
https://www.nla.gov.au                     10
http://Trove@nla.gov.au/                    6
http://www.nla.gov.au:80/?                  2
http://www.nla.gov.au./                     2
http://nla.gov.au                           1
http://mailto:media@nla.gov.au/             1
http://cmccarthy@nla.gov.au/                1
http://mailto:development@nla.gov.au/       1
http://mailto:www@nla.gov.au/               1
http://www.nla.gov.au:80//                  1
http://www.nla.gov.au/?                     1
Name: original, dtype: int64

Ah ok, so there's actually a mix of things in here â€“ some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed `mailto` links. To look at the differences in more detail, let's create new columns for `subdomain` and `protocol`.

In [13]:
base_domain = re.search(r"https*:\/\/(\w*)\.", url).group(1)
df["subdomain"] = df["original"].str.extract(
    r"^https*:\/\/(\w*)\.{}\.".format(base_domain), flags=re.IGNORECASE
)
df["subdomain"].fillna("", inplace=True)
df["subdomain"].value_counts()

www     3602
         738
www2      10
Name: subdomain, dtype: int64

In [14]:
df["protocol"] = df["original"].str.extract(r"^(https*):")
df["protocol"].value_counts()

http     2770
https    1580
Name: protocol, dtype: int64

### Change in protocol

Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.

In [15]:
alt.Chart(df).mark_bar().encode(
    x="year(date):T",
    y=alt.Y("count()", stack="normalize"),
    color="protocol:N",
    # tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)

No real surprise there given the increased use of https generally.

### Status codes by subdomain

Let's now compare the proportion of status codes between the bare `nla.gov.au` domain and the `www` subdomain.

In [16]:
alt.Chart(
    df.loc[(df["statuscode"] != "-") & (df["subdomain"] != "www2")]
).mark_bar().encode(
    x="year(date):T",
    y=alt.Y("count()", stack="normalize"),
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    row="subdomain",
    tooltip=["year(date):T", "statuscode"],
).properties(
    width=700, height=100
)

I think we can start to see what's going on. Around about 2004, requests to `nla.gov.au` started to be redirected to `www.nla.gov.au` giving a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) response, indicating that the page had been moved temporarily. But why the growth in [301](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301) (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the `https` protocol, I think we could guess that `http` requests in both domains are being redirected to `https`.

### Status codes by protocol

Let's test that hypothesis by looking at the distribution of status codes by protocol.

In [17]:
alt.Chart(
    df.loc[(df["statuscode"] != "-") & (df["subdomain"] != "www2")]
).mark_bar().encode(
    x="year(date):T",
    y=alt.Y("count()", stack="normalize"),
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    row="protocol",
    tooltip=["year(date):T", "protocol", "statuscode"],
).properties(
    width=700, height=100
)

We can see that by 2019, all requests using `http` are being redirected to `https`.

## Looking for major changes

Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses.

In [18]:
df_200 = df.copy().loc[
    (df["statuscode"] == "200") & (df["subdomain"] == "www") & (df["length"] > 1000)
]

alt.Chart(df_200).mark_point().encode(
    x="date:T", y="length:Q", tooltip=["date", "length"]
).properties(width=700, height=300)

Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture.

In [19]:
df_200["change_in_length"] = abs(df_200["length"].diff())

Now we can look at the captures that varied most in length from their predecessor.

In [20]:
top_ten_changes = df_200.sort_values(by="change_in_length", ascending=False)[:10]
top_ten_changes

Unnamed: 0,urlkey,timestamp,original,mimetype,statuscode,digest,length,date,subdomain,protocol,change_in_length
3656,"au,gov,nla)/",20210701042826,https://www.nla.gov.au/,text/html,200,6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7,29215,2021-07-01 04:28:26,www,https,13933.0
4134,"au,gov,nla)/",20220202054835,https://www.nla.gov.au/,text/html,200,O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ,27495,2022-02-02 05:48:35,www,https,4648.0
3954,"au,gov,nla)/",20220105025646,https://www.nla.gov.au/,text/html,200,MRIUSTANGOWT3CT5QSSRJ7NJPEN2RSEN,27273,2022-01-05 02:56:46,www,https,4463.0
4058,"au,gov,nla)/",20220121065839,https://www.nla.gov.au/,text/html,200,UAGVH7YN6ZPJYQUIZTP32G4GF3JJ2N7J,22948,2022-01-21 06:58:39,www,https,4394.0
4417,"au,gov,nla)/",20220405063728,https://www.nla.gov.au/,text/html,200,HXSLRIVPKI3ECPC5V6NNEJOIHTDKZ5JJ,22682,2022-04-05 06:37:28,www,https,4375.0
3921,"au,gov,nla)/",20211228211936,https://www.nla.gov.au/,text/html,200,FQZBN2C7DPFPC26F6HX3KDNGWCWVOWWX,22831,2021-12-28 21:19:36,www,https,4374.0
3946,"au,gov,nla)/",20220103064507,https://www.nla.gov.au/,text/html,200,PUJLOJI7OUJ4XFKUDI47HOQLFOMEQLKJ,22917,2022-01-03 06:45:07,www,https,4367.0
4322,"au,gov,nla)/",20220323022916,https://www.nla.gov.au/,text/html,200,WEP3PTHC7CAEF22S3NDIMZHFDAQIK65J,26919,2022-03-23 02:29:16,www,https,4359.0
3939,"au,gov,nla)/",20220102132710,https://www.nla.gov.au/,text/html,200,SBM5ATRRZWVT7HYA6J3BMMOXG4HTDZFD,27251,2022-01-02 13:27:10,www,https,4352.0
4215,"au,gov,nla)/",20220224211329,https://www.nla.gov.au/,text/html,200,WCIDXIQ22M35PWXUGK7GCXG2LEJUAJ5J,23208,2022-02-24 21:13:29,www,https,4351.0


Let's try visualising this by highlighting the major changes in length.

In [21]:
points = (
    alt.Chart(df_200)
    .mark_point()
    .encode(x="date:T", y="length:Q", tooltip=["date", "length"])
    .properties(width=700, height=300)
)

lines = (
    alt.Chart(top_ten_changes)
    .mark_rule(color="red")
    .encode(x="date:T", tooltip=["date"])
    .properties(width=700, height=300)
)

points + lines

Rather than just a raw number, perhaps the percentage change in length would be more useful. Once again, Pandas makes this easy to calculate. This calculates the percentage change from the previous value â€“ so length2 - length1 / length1.

In [22]:
df_200["pct_change_in_length"] = abs(df_200["length"].pct_change())

In [23]:
top_ten_changes_pct = df_200.sort_values(by="pct_change_in_length", ascending=False)[
    :10
]
top_ten_changes_pct

Unnamed: 0,urlkey,timestamp,original,mimetype,statuscode,digest,length,date,subdomain,protocol,change_in_length,pct_change_in_length
3656,"au,gov,nla)/",20210701042826,https://www.nla.gov.au/,text/html,200,6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7,29215,2021-07-01 04:28:26,www,https,13933.0,0.911726
13,"au,gov,nla)/",19980205162107,http://www.nla.gov.au:80/,text/html,200,LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X,1920,1998-02-05 16:21:07,www,http,757.0,0.650903
79,"au,gov,nla)/",20011003175018,http://www.nla.gov.au:80/,text/html,200,BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN,3367,2001-10-03 17:50:18,www,http,1004.0,0.424884
1519,"au,gov,nla)/",20160901112433,http://www.nla.gov.au/,text/html,200,MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2,11541,2016-09-01 11:24:33,www,http,2738.0,0.31103
1184,"au,gov,nla)/",20130211044309,http://www.nla.gov.au/,text/html,200,QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN,8521,2013-02-11 04:43:09,www,http,1698.0,0.248864
1067,"au,gov,nla)/",20110611064218,http://www.nla.gov.au/,text/html,200,Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT,5601,2011-06-11 06:42:18,www,http,1739.0,0.236921
2049,"au,gov,nla)/",20181212014241,https://www.nla.gov.au/,text/html,200,C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM,14813,2018-12-12 01:42:41,www,https,2831.0,0.236271
786,"au,gov,nla)/",20061107083938,http://www.nla.gov.au:80/,text/html,200,HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF,5662,2006-11-07 08:39:38,www,http,1561.0,0.216115
4134,"au,gov,nla)/",20220202054835,https://www.nla.gov.au/,text/html,200,O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ,27495,2022-02-02 05:48:35,www,https,4648.0,0.20344
3864,"au,gov,nla)/",20211121150551,https://www.nla.gov.au/,text/html,200,GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY,25912,2021-11-21 15:05:51,www,https,4316.0,0.199852


In [24]:
lines = (
    alt.Chart(top_ten_changes_pct)
    .mark_rule(color="red")
    .encode(x="date:T", tooltip=["date"])
    .properties(width=700, height=300)
)

points + lines

By focusing on percentage difference we can see that more prominence is given to the change in 2001. But rather than just the top 10, should we look at changes greater than 10% or some other threshold?

In [25]:
lines = (
    alt.Chart(df_200.loc[df_200["pct_change_in_length"] > 0.1])
    .mark_rule(color="red")
    .encode(x="date:T", tooltip=["date"])
    .properties(width=700, height=300)
)

points + lines

## Other possibilities to explore

* Rate of change â€“ what proportion of the snapshots each year are *different*?
* Use similarity measures to identify changes.

## Comparing individual captures

Once major changes, such as those above, have been identified, we can use some of the other notebooks in this repository to compare individual captures. For example:

* [Compare two versions of an archived web page](show_diffs.ipynb)
* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)

----
Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!

Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)