{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Observing change in a web page over time\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "This notebook explores what we can find when you look at all captures of a single page over time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Work in progress – this notebook isn't finished yet. Check back later for more...
" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import pandas as pd\n", "import altair as alt\n", "import re\n", "from difflib import HtmlDiff\n", "from IPython.display import display, HTML\n", "import arrow" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "def query_cdx(url, **kwargs):\n", " params = kwargs\n", " params['url'] = url\n", " params['output'] = 'json'\n", " response = requests.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})\n", " response.raise_for_status()\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "url = 'http://nla.gov.au'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the data\n", "\n", "In this example we're using the IA CDX API, but this could easily be adapted to use [Timemaps](find_all_captures.ipynb) from a range of repositories. " ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "data = query_cdx(url)\n", "\n", "# Convert to a dataframe\n", "# The column names are in the first row\n", "df = pd.DataFrame(data[1:], columns=data[0])\n", "\n", "# Convert the timestamp string into a datetime object\n", "df['date'] = pd.to_datetime(df['timestamp'])\n", "df.sort_values(by='date', inplace=True, ignore_index=True)\n", "\n", "# Convert the length from a string into an integer\n", "df['length'] = df['length'].astype('int')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in the notebook [comparing the CDX API with Timemaps](getting_all_snapshots_timemap_vs_cdx.ipynb), there are a number of duplicate snapshots in the CDX results, so let's remove them." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: 2840\n", "After: 2740\n" ] } ], "source": [ "print(f'Before: {df.shape[0]}')\n", "df.drop_duplicates(subset=['timestamp', 'original', 'digest', 'statuscode', 'mimetype'], keep='first', inplace=True)\n", "print(f'After: {df.shape[0]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The basic shape" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('1996-10-19 06:42:23')" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['date'].min()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('2020-04-27 07:42:20')" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['date'].max()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 2740.000000\n", "mean 6497.322263\n", "std 5027.627203\n", "min 296.000000\n", "25% 643.000000\n", "50% 5405.500000\n", "75% 11409.500000\n", "max 15950.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['length'].describe()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200 2036\n", "301 273\n", "302 263\n", "- 166\n", "503 2\n", "Name: statuscode, dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['statuscode'].value_counts()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "text/html 2574\n", "warc/revisit 166\n", "Name: mimetype, dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['mimetype'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting snapshots over time" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is just a bit of fancy customisation to group the types of errors by color\n", "# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors\n", "domain = ['-', '200', '301', '302', '404', '503']\n", "# green for ok, blue for redirects, red for errors\n", "range_ = ['#888888', '#39a035', '#5ba3cf', '#125ca4', '#e13128', '#b21218']\n", "\n", "alt.Chart(df).mark_point().encode(\n", " x='date:T',\n", " y='length:Q',\n", " color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),\n", " tooltip=['date', 'length', 'statuscode']\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at domains, protocols, and redirects\n", "\n", "Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the `original` column. These are the urls being requested by the archiving bot." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http://www.nla.gov.au:80/ 863\n", "http://www.nla.gov.au/ 728\n", "https://www.nla.gov.au/ 590\n", "http://nla.gov.au/ 421\n", "http://nla.gov.au:80/ 74\n", "http://www.nla.gov.au// 17\n", "https://nla.gov.au/ 14\n", "http://www.nla.gov.au 11\n", "http://www2.nla.gov.au:80/ 10\n", "http://Trove@nla.gov.au/ 6\n", "http://www.nla.gov.au:80/? 2\n", "http://www.nla.gov.au:80// 1\n", "http://www.nla.gov.au./ 1\n", "http://mailto:development@nla.gov.au/ 1\n", "http://mailto:www@nla.gov.au/ 1\n", "Name: original, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['original'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed `mailto` links. To look at the differences in more detail, let's create new columns for `subdomain` and `protocol`." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "www 2213\n", " 517\n", "www2 10\n", "Name: subdomain, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_domain = re.search(r'https*:\\/\\/(\\w*)\\.', url).group(1)\n", "df['subdomain'] = df['original'].str.extract(r'^https*:\\/\\/(\\w*)\\.{}\\.'.format(base_domain), flags=re.IGNORECASE)\n", "df['subdomain'].fillna('', inplace=True)\n", "df['subdomain'].value_counts()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "http 2136\n", "https 604\n", "Name: protocol, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['protocol'] = df['original'].str.extract(r'^(https*):')\n", "df['protocol'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change in protocol\n", "\n", "Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df).mark_bar().encode(\n", " x='year(date):T',\n", " y=alt.Y('count()',stack=\"normalize\"),\n", " color='protocol:N',\n", " #tooltip=['date', 'length', 'subdomain:N']\n", ").properties(width=700, height=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No real surprise there given the increased use of https generally." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by subdomain\n", "\n", "Let's now compare the proportion of status codes between the bare `nla.gov.au` domain and the `www` subdomain." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df.loc[(df['statuscode'] != '-') & (df['subdomain'] != 'www2')]).mark_bar().encode(\n", " x='year(date):T',\n", " y=alt.Y('count()',stack=\"normalize\"),\n", " color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),\n", " row='subdomain',\n", " tooltip=['year(date):T', 'statuscode']\n", ").properties(width=700, height=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think we can start to see what's going on. Around about 2004, requests to `nla.gov.au` started to be redirected to `www.nla.gov.au` giving a [302](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/302) response, indicating that the page had been moved temporarily. But why the growth in [301](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/301) (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the `https` protocol, I think we could guess that `http` requests in both domains are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Status codes by protocol\n", "\n", "Let's test that hypothesis by looking at the distribution of status codes by protocol." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df.loc[(df['statuscode'] != '-') & (df['subdomain'] != 'www2')]).mark_bar().encode(\n", " x='year(date):T',\n", " y=alt.Y('count()',stack=\"normalize\"),\n", " color=alt.Color('statuscode', scale=alt.Scale(domain=domain, range=range_)),\n", " row='protocol',\n", " tooltip=['year(date):T', 'protocol', 'statuscode']\n", ").properties(width=700, height=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that by 2019, all requests using `http` are being redirected to `https`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking for major changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_200 = df.copy().loc[(df['statuscode'] == '200') & (df['subdomain'] == 'www') & (df['length'] > 1000)]\n", "\n", "alt.Chart(df_200).mark_point().encode(\n", " x='date:T',\n", " y='length:Q',\n", " tooltip=['date', 'length']\n", ").properties(width=700, height=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "df_200['change_in_length'] = abs(df_200['length'].diff())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the captures that varied most in length from their predecessor." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "subdomain | \n", "protocol | \n", "change_in_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
2043 | \n", "au,gov,nla)/ | \n", "20181212014241 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM | \n", "14813 | \n", "2018-12-12 01:42:41 | \n", "www | \n", "https | \n", "2831.0 | \n", "
1519 | \n", "au,gov,nla)/ | \n", "20160901112433 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2 | \n", "11541 | \n", "2016-09-01 11:24:33 | \n", "www | \n", "http | \n", "2738.0 | \n", "
1067 | \n", "au,gov,nla)/ | \n", "20110611064218 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT | \n", "5601 | \n", "2011-06-11 06:42:18 | \n", "www | \n", "http | \n", "1739.0 | \n", "
1183 | \n", "au,gov,nla)/ | \n", "20130211044309 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN | \n", "8521 | \n", "2013-02-11 04:43:09 | \n", "www | \n", "http | \n", "1698.0 | \n", "
786 | \n", "au,gov,nla)/ | \n", "20061107083938 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF | \n", "5662 | \n", "2006-11-07 08:39:38 | \n", "www | \n", "http | \n", "1561.0 | \n", "
1185 | \n", "au,gov,nla)/ | \n", "20130302083331 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "77Y6PJF3MYUZ4JUSTK4T237RRUASTO7X | \n", "6965 | \n", "2013-03-02 08:33:31 | \n", "www | \n", "http | \n", "1556.0 | \n", "
79 | \n", "au,gov,nla)/ | \n", "20011003175018 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN | \n", "3367 | \n", "2001-10-03 17:50:18 | \n", "www | \n", "http | \n", "1004.0 | \n", "
906 | \n", "au,gov,nla)/ | \n", "20090622194559 | \n", "http://www.nla.gov.au:80/? | \n", "text/html | \n", "200 | \n", "X6KRELQBTLUYZT7NWH6JRVJGCBF7YFQB | \n", "6495 | \n", "2009-06-22 19:45:59 | \n", "www | \n", "http | \n", "925.0 | \n", "
2131 | \n", "au,gov,nla)/ | \n", "20190319065001 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "ZGSCTK3IMTBSAJ7PAQUWOH7GATGT5MB4 | \n", "14478 | \n", "2019-03-19 06:50:01 | \n", "www | \n", "https | \n", "854.0 | \n", "
13 | \n", "au,gov,nla)/ | \n", "19980205162107 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X | \n", "1920 | \n", "1998-02-05 16:21:07 | \n", "www | \n", "http | \n", "757.0 | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "length | \n", "date | \n", "subdomain | \n", "protocol | \n", "change_in_length | \n", "pct_change_in_length | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | \n", "au,gov,nla)/ | \n", "19980205162107 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X | \n", "1920 | \n", "1998-02-05 16:21:07 | \n", "www | \n", "http | \n", "757.0 | \n", "0.650903 | \n", "
79 | \n", "au,gov,nla)/ | \n", "20011003175018 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN | \n", "3367 | \n", "2001-10-03 17:50:18 | \n", "www | \n", "http | \n", "1004.0 | \n", "0.424884 | \n", "
1519 | \n", "au,gov,nla)/ | \n", "20160901112433 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2 | \n", "11541 | \n", "2016-09-01 11:24:33 | \n", "www | \n", "http | \n", "2738.0 | \n", "0.311030 | \n", "
1183 | \n", "au,gov,nla)/ | \n", "20130211044309 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN | \n", "8521 | \n", "2013-02-11 04:43:09 | \n", "www | \n", "http | \n", "1698.0 | \n", "0.248864 | \n", "
1067 | \n", "au,gov,nla)/ | \n", "20110611064218 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT | \n", "5601 | \n", "2011-06-11 06:42:18 | \n", "www | \n", "http | \n", "1739.0 | \n", "0.236921 | \n", "
2043 | \n", "au,gov,nla)/ | \n", "20181212014241 | \n", "https://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM | \n", "14813 | \n", "2018-12-12 01:42:41 | \n", "www | \n", "https | \n", "2831.0 | \n", "0.236271 | \n", "
786 | \n", "au,gov,nla)/ | \n", "20061107083938 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF | \n", "5662 | \n", "2006-11-07 08:39:38 | \n", "www | \n", "http | \n", "1561.0 | \n", "0.216115 | \n", "
1185 | \n", "au,gov,nla)/ | \n", "20130302083331 | \n", "http://www.nla.gov.au/ | \n", "text/html | \n", "200 | \n", "77Y6PJF3MYUZ4JUSTK4T237RRUASTO7X | \n", "6965 | \n", "2013-03-02 08:33:31 | \n", "www | \n", "http | \n", "1556.0 | \n", "0.182608 | \n", "
135 | \n", "au,gov,nla)/ | \n", "20031230162952 | \n", "http://www.nla.gov.au:80/ | \n", "text/html | \n", "200 | \n", "F2VL75K4I4ZZDIOZVX4D5W7Y5UXHKLTO | \n", "4394 | \n", "2003-12-30 16:29:52 | \n", "www | \n", "http | \n", "655.0 | \n", "0.175181 | \n", "
906 | \n", "au,gov,nla)/ | \n", "20090622194559 | \n", "http://www.nla.gov.au:80/? | \n", "text/html | \n", "200 | \n", "X6KRELQBTLUYZT7NWH6JRVJGCBF7YFQB | \n", "6495 | \n", "2009-06-22 19:45:59 | \n", "www | \n", "http | \n", "925.0 | \n", "0.124663 | \n", "