# Observing change in a web page over time

<p class="alert alert-info">New to Jupyter notebooks? Try <a href="getting-started/Using_Jupyter_notebooks.ipynb"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>

This notebook explores what we can find when you look at all captures of a single page over time.

<p class="alert alert-warning">Work in progress â€“ this notebook isn't finished yet. Check back later for more...<p>

In [71]:
import requests
import pandas as pd
import altair as alt
import re
from difflib import HtmlDiff
from IPython.display import display, HTML
import arrow

In [72]:
def query_cdx(url, **kwargs):
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
    response.raise_for_status()
    return response.json()

In [73]:
url = 'http://nla.gov.au'

## Getting the data

In this example we're using the IA CDX API, but this could easily be adapted to use [Timemaps](find_all_captures.ipynb) from a range of repositories. 

In [74]:
data = query_cdx(url)

# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])

# Convert the timestamp string into a datetime object
df['date'] = pd.to_datetime(df['timestamp'])
df.sort_values(by='date', inplace=True, ignore_index=True)

# Convert the length from a string into an integer
df['length'] = df['length'].astype('int')

As noted in the notebook [comparing the CDX API with Timemaps](getting_all_snapshots_timemap_vs_cdx.ipynb), there are a number of duplicate snapshots in the CDX results, so let's remove them.

In [75]:
print(f'Before: {df.shape[0]}')
df.drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode', 'mimetype'], keep='first', inplace=True)
print(f'After: {df.shape[0]}')

Before: 2840
After: 2740


## The basic shape

In [35]:
df['date'].min()

Timestamp('1996-10-19 06:42:23')

In [36]:
df['date'].max()

Timestamp('2020-04-27 07:42:20')

In [37]:
df['length'].describe()

count     2740.000000
mean      6497.322263
std       5027.627203
min        296.000000
25%        643.000000
50%       5405.500000
75%      11409.500000
max      15950.000000
Name: length, dtype: float64

In [38]:
df['statuscode'].value_counts()

200    2036
301     273
302     263
-       166
503       2
Name: statuscode, dtype: int64

In [39]:
df['mimetype'].value_counts()

text/html       2574
warc/revisit     166
Name: mimetype, dtype: int64

## Plotting snapshots over time

In [40]:
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ['-', '200', '301', '302', '404', '503']
# green for ok, blue for redirects, red for errors
range_ = ['#888888', '#39a035', '#5ba3cf', '#125ca4', '#e13128', '#b21218']

alt.Chart(df).mark_point().encode(
    x='date:T',
    y='length:Q',
    color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),
    tooltip=['date', 'length', 'statuscode']
).properties(width=700, height=300)

## Looking at domains, protocols, and redirects

Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the `original` column. These are the urls being requested by the archiving bot.

In [41]:
df['original'].value_counts()

http://www.nla.gov.au:80/                863
http://www.nla.gov.au/                   728
https://www.nla.gov.au/                  590
http://nla.gov.au/                       421
http://nla.gov.au:80/                     74
http://www.nla.gov.au//                   17
https://nla.gov.au/                       14
http://www.nla.gov.au                     11
http://www2.nla.gov.au:80/                10
http://Trove@nla.gov.au/                   6
http://www.nla.gov.au:80/?                 2
http://www.nla.gov.au:80//                 1
http://www.nla.gov.au./                    1
http://mailto:development@nla.gov.au/      1
http://mailto:www@nla.gov.au/              1
Name: original, dtype: int64

Ah ok, so there's actually a mix of things in here â€“ some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed `mailto` links. To look at the differences in more detail, let's create new columns for `subdomain` and `protocol`.

In [42]:
base_domain = re.search(r'https*:\/\/(\w*)\.', url).group(1)
df['subdomain'] = df['original'].str.extract(r'^https*:\/\/(\w*)\.{}\.'.format(base_domain), flags=re.IGNORECASE)
df['subdomain'].fillna('', inplace=True)
df['subdomain'].value_counts()

www     2213
         517
www2      10
Name: subdomain, dtype: int64

In [43]:
df['protocol'] = df['original'].str.extract(r'^(https*):')
df['protocol'].value_counts()

http     2136
https     604
Name: protocol, dtype: int64

### Change in protocol

Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.

In [44]:
alt.Chart(df).mark_bar().encode(
    x='year(date):T',
    y=alt.Y('count()',stack="normalize"),
    color='protocol:N',
    #tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)

No real surprise there given the increased use of https generally.

### Status codes by subdomain

Let's now compare the proportion of status codes between the bare `nla.gov.au` domain and the `www` subdomain.

In [45]:
alt.Chart(df.loc[(df['statuscode'] != '-') & (df['subdomain'] != 'www2')]).mark_bar().encode(
    x='year(date):T',
    y=alt.Y('count()',stack="normalize"),
    color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),
    row='subdomain',
    tooltip=['year(date):T', 'statuscode']
).properties(width=700, height=100)

I think we can start to see what's going on. Around about 2004, requests to `nla.gov.au` started to be redirected to `www.nla.gov.au` giving a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) response, indicating that the page had been moved temporarily. But why the growth in [301](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301) (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the `https` protocol, I think we could guess that `http` requests in both domains are being redirected to `https`.

### Status codes by protocol

Let's test that hypothesis by looking at the distribution of status codes by protocol.

In [46]:
alt.Chart(df.loc[(df['statuscode'] != '-') & (df['subdomain'] != 'www2')]).mark_bar().encode(
    x='year(date):T',
    y=alt.Y('count()',stack="normalize"),
    color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),
    row='protocol',
    tooltip=['year(date):T', 'protocol', 'statuscode']
).properties(width=700, height=100)

We can see that by 2019, all requests using `http` are being redirected to `https`.

## Looking for major changes

Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses.

In [47]:
df_200 = df.copy().loc[(df['statuscode'] == '200') & (df['subdomain'] == 'www') & (df['length'] > 1000)]

alt.Chart(df_200).mark_point().encode(
    x='date:T',
    y='length:Q',
    tooltip=['date', 'length']
).properties(width=700, height=300)

Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture.

In [48]:
df_200['change_in_length'] = abs(df_200['length'].diff())

Now we can look at the captures that varied most in length from their predecessor.

In [49]:
top_ten_changes = df_200.sort_values(by='change_in_length', ascending=False)[:10]
top_ten_changes

Unnamed: 0,urlkey,timestamp,original,mimetype,statuscode,digest,length,date,subdomain,protocol,change_in_length
2043,"au,gov,nla)/",20181212014241,https://www.nla.gov.au/,text/html,200,C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM,14813,2018-12-12 01:42:41,www,https,2831.0
1519,"au,gov,nla)/",20160901112433,http://www.nla.gov.au/,text/html,200,MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2,11541,2016-09-01 11:24:33,www,http,2738.0
1067,"au,gov,nla)/",20110611064218,http://www.nla.gov.au/,text/html,200,Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT,5601,2011-06-11 06:42:18,www,http,1739.0
1183,"au,gov,nla)/",20130211044309,http://www.nla.gov.au/,text/html,200,QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN,8521,2013-02-11 04:43:09,www,http,1698.0
786,"au,gov,nla)/",20061107083938,http://www.nla.gov.au:80/,text/html,200,HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF,5662,2006-11-07 08:39:38,www,http,1561.0
1185,"au,gov,nla)/",20130302083331,http://www.nla.gov.au/,text/html,200,77Y6PJF3MYUZ4JUSTK4T237RRUASTO7X,6965,2013-03-02 08:33:31,www,http,1556.0
79,"au,gov,nla)/",20011003175018,http://www.nla.gov.au:80/,text/html,200,BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN,3367,2001-10-03 17:50:18,www,http,1004.0
906,"au,gov,nla)/",20090622194559,http://www.nla.gov.au:80/?,text/html,200,X6KRELQBTLUYZT7NWH6JRVJGCBF7YFQB,6495,2009-06-22 19:45:59,www,http,925.0
2131,"au,gov,nla)/",20190319065001,https://www.nla.gov.au/,text/html,200,ZGSCTK3IMTBSAJ7PAQUWOH7GATGT5MB4,14478,2019-03-19 06:50:01,www,https,854.0
13,"au,gov,nla)/",19980205162107,http://www.nla.gov.au:80/,text/html,200,LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X,1920,1998-02-05 16:21:07,www,http,757.0


Let's try visualising this by highlighting the major changes in length.

In [50]:
points = alt.Chart(df_200).mark_point().encode(
    x='date:T',
    y='length:Q',
    tooltip=['date', 'length']
).properties(width=700, height=300)

lines = alt.Chart(top_ten_changes).mark_rule(color='red').encode(
    x='date:T',
    tooltip=['date']
).properties(width=700, height=300)

points + lines

Rather than just a raw number, perhaps the percentage change in length would be more useful. Once again, Pandas makes this easy to calculate. This calculates the percentage change from the previous value â€“ so length2 - length1 / length1.

In [51]:
df_200['pct_change_in_length'] = abs(df_200['length'].pct_change())

In [52]:
top_ten_changes_pct = df_200.sort_values(by='pct_change_in_length', ascending=False)[:10]
top_ten_changes_pct

Unnamed: 0,urlkey,timestamp,original,mimetype,statuscode,digest,length,date,subdomain,protocol,change_in_length,pct_change_in_length
13,"au,gov,nla)/",19980205162107,http://www.nla.gov.au:80/,text/html,200,LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X,1920,1998-02-05 16:21:07,www,http,757.0,0.650903
79,"au,gov,nla)/",20011003175018,http://www.nla.gov.au:80/,text/html,200,BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN,3367,2001-10-03 17:50:18,www,http,1004.0,0.424884
1519,"au,gov,nla)/",20160901112433,http://www.nla.gov.au/,text/html,200,MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2,11541,2016-09-01 11:24:33,www,http,2738.0,0.31103
1183,"au,gov,nla)/",20130211044309,http://www.nla.gov.au/,text/html,200,QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN,8521,2013-02-11 04:43:09,www,http,1698.0,0.248864
1067,"au,gov,nla)/",20110611064218,http://www.nla.gov.au/,text/html,200,Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT,5601,2011-06-11 06:42:18,www,http,1739.0,0.236921
2043,"au,gov,nla)/",20181212014241,https://www.nla.gov.au/,text/html,200,C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM,14813,2018-12-12 01:42:41,www,https,2831.0,0.236271
786,"au,gov,nla)/",20061107083938,http://www.nla.gov.au:80/,text/html,200,HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF,5662,2006-11-07 08:39:38,www,http,1561.0,0.216115
1185,"au,gov,nla)/",20130302083331,http://www.nla.gov.au/,text/html,200,77Y6PJF3MYUZ4JUSTK4T237RRUASTO7X,6965,2013-03-02 08:33:31,www,http,1556.0,0.182608
135,"au,gov,nla)/",20031230162952,http://www.nla.gov.au:80/,text/html,200,F2VL75K4I4ZZDIOZVX4D5W7Y5UXHKLTO,4394,2003-12-30 16:29:52,www,http,655.0,0.175181
906,"au,gov,nla)/",20090622194559,http://www.nla.gov.au:80/?,text/html,200,X6KRELQBTLUYZT7NWH6JRVJGCBF7YFQB,6495,2009-06-22 19:45:59,www,http,925.0,0.124663


In [53]:
lines = alt.Chart(top_ten_changes_pct).mark_rule(color='red').encode(
    x='date:T',
    tooltip=['date']
).properties(width=700, height=300)

points + lines

By focusing on percentage difference we can see that more prominence is given to the change in 2001. But rather than just the top 10, should we look at changes greater than 10% or some other threshold?

In [54]:
lines = alt.Chart(df_200.loc[df_200['pct_change_in_length'] > 0.1]).mark_rule(color='red').encode(
    x='date:T',
    tooltip=['date']
).properties(width=700, height=300)

points + lines

## Other possibilities to explore

* Rate of change â€“ what proportion of the snapshots each year are *different*?
* Use similarity measures to identify changes.

## Comparing individual captures

Once major changes, such as those above, have been identified, we can use some of the other notebooks in this repository to compare individual captures. For example:

* [Compare two versions of an archived web page](show_diffs.ipynb)
* [Create and compare full page screenshots from archived web pages](save_screenshot.ipynb)

----
Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io).

Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)