{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using screenshots to visualise change in a page over time\n", "\n", "**TIMEMAPS VERSION**\n", "\n", "![Screenshots showing changes to the ABC Australia home page over time](images/abc-net-au.png)\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

\n", "\n", "This notebook helps you visualise changes in a web page by generating full page screenshots for each year from the captures available in an archive. You can then combine the individual screenshots into a single composite image.\n", "\n", "See [TimeMap Visualization](http://tmvis.ws-dl.cs.odu.edu/) from the Web Science and Digital Libraries Research Group at Old Dominion University for another way of using web archive thumbnails to explore change over time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating all the screenshots can be quite slow, and sometimes the captures themselves are incomplete, so I've divided the work into a number of stages:\n", "\n", "1. Create the individual screenshots\n", "2. Review the screenshot results and generate more screenshots if needed\n", "3. Create the composite image\n", "\n", "More details are below.\n", "\n", "If you want to create an individual screenshot, or compare a series of selected screenshots, see the [Get full page screenshots from archived web pages](save_screenshot.ipynb) notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import base64\n", "import io\n", "import json\n", "import math\n", "import re\n", "import time\n", "from pathlib import Path\n", "from urllib.parse import urlparse\n", "\n", "import geckodriver_autoinstaller\n", "import pandas as pd\n", "import PIL\n", "import requests\n", "import selenium\n", "from PIL import Image, ImageDraw, ImageFont\n", "from selenium import webdriver\n", "from selenium.webdriver.common.by import By\n", "from slugify import slugify\n", "\n", "geckodriver_autoinstaller.install()\n", "\n", "# See https://github.com/ouseful-template-repos/binder-selenium-demoscraper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TIMEGATES = {\n", " \"nla\": \"https://web.archive.org.au/awa/\",\n", " \"nlnz\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"bl\": \"https://www.webarchive.org.uk/wayback/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", " \"is\": \"http://wayback.vefsafn.is/wayback/\",\n", " \"ukgwa\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/\",\n", "}\n", "\n", "wayback = [\"web.archive.org\", \"wayback.vefsafn.is\"]\n", "pywb = {\n", " \"web.archive.org.au\": \"replayFrame\",\n", " \"webarchive.nla.gov.au\": \"replayFrame\",\n", " \"webarchive.org.uk\": \"replay_iframe\",\n", " \"ndhadeliver.natlib.govt.nz\": \"replayFrame\",\n", " \"webarchive.nationalarchives.gov.uk\": \"replay_iframe\",\n", "}\n", "\n", "html_output = []\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " # Rename keys\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def get_capture_data_from_memento(url, request_type=\"head\"):\n", " \"\"\"\n", " For OpenWayback systems this can get some extra cpature info to insert in Timemaps.\n", " \"\"\"\n", " if request_type == \"head\":\n", " response = requests.head(url)\n", " else:\n", " response = requests.get(url)\n", " headers = response.headers\n", " length = headers.get(\"x-archive-orig-content-length\")\n", " status = headers.get(\"x-archive-orig-status\")\n", " status = status.split(\" \")[0] if status else None\n", " mime = headers.get(\"x-archive-orig-content-type\")\n", " mime = mime.split(\";\")[0] if mime else None\n", " return {\"length\": length, \"status\": status, \"mime\": mime}\n", "\n", "\n", "def convert_link_to_json(results, enrich_data=False):\n", " \"\"\"\n", " Converts link formatted Timemap to JSON.\n", " \"\"\"\n", " data = []\n", " for line in results.splitlines():\n", " parts = line.split(\"; \")\n", " if len(parts) > 1:\n", " link_type = re.search(\n", " r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n", " parts[1],\n", " ).group(1)\n", " if link_type == \"memento\":\n", " link = parts[0].strip(\"<>\")\n", " timestamp, original = re.search(r\"/(\\d{12}|\\d{14})/(.*)$\", link).groups()\n", " capture = {\"timestamp\": timestamp, \"url\": original}\n", " if enrich_data:\n", " capture.update(get_capture_data_from_memento(link))\n", " data.append(capture)\n", " return data\n", "\n", "\n", "def get_timemap_as_json(timegate, url):\n", " \"\"\"\n", " Get a Timemap then normalise results (if necessary) to return a list of dicts.\n", " \"\"\"\n", " tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n", " print(tg_url)\n", " response = requests.get(tg_url)\n", " response_type = response.headers[\"content-type\"].split(\";\")[0]\n", " # pywb style Timemap\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " # IA Wayback stype Timemap\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " # Link style Timemap (OpenWayback)\n", " elif response_type in [\"application/link-format\", \"text/html\"]:\n", " data = convert_link_to_json(response.text)\n", " return data\n", "\n", "\n", "def get_full_page_screenshot(url, save_width=200):\n", " \"\"\"\n", " Gets a full page screenshot of the supplied url.\n", " By default resizes the screenshot to a maximum width of 200px.\n", " Provide a 'save_width' value to change this.\n", "\n", " NOTE the webdriver sometimes fails for unknown reasons. Just try again.\n", " \"\"\"\n", " global html_output\n", " domain = urlparse(url)[1].replace(\"www.\", \"\")\n", " # NZ and IA inject content into the page, so we use if_ to get the original page (with rewritten urls)\n", " if domain in wayback and \"if_\" not in url:\n", " url = re.sub(r\"/(\\d{12}|\\d{14})/http\", r\"/\\1if_/http\", url)\n", " print(url)\n", " date_str, site = re.search(r\"/(\\d{14}|\\d{12})(?:if_|mp_)*/https*://?(.+/)\", url).groups()\n", " ss_dir = Path(\"screenshots\", slugify(site))\n", " ss_dir.mkdir(parents=True, exist_ok=True)\n", " ss_file = Path(ss_dir, f\"{slugify(site)}-{date_str}-{save_width}.png\")\n", " if not ss_file.exists():\n", " options = webdriver.FirefoxOptions()\n", " options.headless = True\n", " driver = webdriver.Firefox(options=options)\n", " driver.implicitly_wait(15)\n", " driver.get(url)\n", " # Give some time for everything to load\n", " time.sleep(30)\n", " driver.maximize_window()\n", " # UK and AU use pywb in framed replay mode, so we need to switch to the framed content\n", " if domain in pywb:\n", " try:\n", " driver.switch_to.frame(pywb[domain])\n", " except selenium.common.exceptions.NoSuchFrameException:\n", " # If we pass here we'll probably still get a ss, just not full page -- better than failing?\n", " pass\n", " ss = None\n", " for tag in [\"body\", \"html\", \"frameset\"]:\n", " try:\n", " elem = driver.find_element(By.TAG_NAME, tag)\n", " ss = elem.screenshot_as_base64\n", " break\n", " except (\n", " selenium.common.exceptions.NoSuchElementException,\n", " selenium.common.exceptions.WebDriverException,\n", " ):\n", " pass\n", " driver.quit()\n", " if not ss:\n", " print(f\"Couldn't get a screenshot of {url} – sorry...\")\n", " else:\n", " img = Image.open(io.BytesIO(base64.b64decode(ss)))\n", " ratio = save_width / img.width\n", " (width, height) = (save_width, math.ceil(img.height * ratio))\n", " resized_img = img.resize((width, height), PIL.Image.Resampling.LANCZOS)\n", " resized_img.save(ss_file)\n", " return ss_file\n", " else:\n", " return ss_file\n", "\n", "\n", "def get_screenshots(timegate, url, num=1):\n", " \"\"\"\n", " Generate up to the specified number of screenshots for each year.\n", " Queries Timemap for snapshots of the given url from the specified repository,\n", " then gets the first 'num' timestamps for each year.\n", " \"\"\"\n", " data = get_timemap_as_json(timegate, url)\n", " df = pd.DataFrame(data)\n", " # Convert the timestamp string into a datetime object\n", " df[\"date\"] = pd.to_datetime(df[\"timestamp\"])\n", " # Sort by date\n", " df.sort_values(by=[\"date\"], inplace=True)\n", " # Only keep the first instance of each digest if it exists\n", " # OpenWayback systems won't return digests\n", " try:\n", " df.drop_duplicates(subset=[\"digest\"], inplace=True)\n", " except KeyError:\n", " pass\n", " # Extract year from date\n", " df[\"year\"] = df[\"date\"].dt.year\n", " # Get the first 'num' instances from each year\n", " # (you only need one, but if there are failures, you might want a backup)\n", " df_years = df.groupby(\"year\", as_index=False).head(num)\n", " timestamps = df_years[\"timestamp\"].to_list()\n", " for timestamp in timestamps:\n", " capture_url = f\"{TIMEGATES[timegate]}{timestamp}/{url}\"\n", " capture_url = (\n", " f\"{capture_url}/\" if not capture_url.endswith(\"/\") else capture_url\n", " )\n", " print(f\"Generating screenshot for: {capture_url}...\")\n", " get_full_page_screenshot(capture_url)\n", "\n", "\n", "def make_composite(url):\n", " \"\"\"\n", " Combine single screenshots into a composite image.\n", " Loops through images in a directory with the given (slugified) domain.\n", " \"\"\"\n", " max_height = 0\n", " url_path = re.sub(r\"^https*://\", \"\", url)\n", " pngs = sorted(Path(\"screenshots\", slugify(url_path)).glob(\"*.png\"))\n", " for png in pngs:\n", " img = Image.open(png)\n", " if img.height > max_height:\n", " max_height = img.height\n", " height = max_height + 110\n", " width = (len(pngs) * 200) + (len(pngs) * 10) + 10\n", " comp = Image.new(\"RGB\", (width, height), (90, 90, 90))\n", " # Canvas to write in the dates\n", " draw = ImageDraw.Draw(comp)\n", " # Change this to suit your system\n", " # font = ImageFont.truetype(\"/Library/Fonts/Microsoft/Gill Sans MT Bold.ttf\", 36)\n", " # Something like this should work on Binder?\n", " font = ImageFont.truetype(\n", " \"/usr/share/fonts/truetype/liberation/LiberationSansNarrow-Regular.ttf\", 36\n", " )\n", " draw.text((10, height - 50), url, (255, 255, 255), font=font)\n", " for i, png in enumerate(pngs):\n", " year = re.search(r\"-(\\d{4})(\\d{10}|\\d{8}).*?\\.png\", png.name).group(1)\n", " draw.text(((i * 210) + 10, 10), year, (255, 255, 255), font=font)\n", " img = Image.open(png)\n", " comp.paste(img, ((i * 210) + 10, 50))\n", " comp.save(Path(\"screenshots\", f\"{slugify(url_path)}.png\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create screenshots\n", "\n", "Edit the cell below to change the `timegate` and `url` values as desired. Timegate values can be one of:\n", "\n", "* `bl` – UK Web Archive\n", "* `ukgwa` – UK Government Web Archive\n", "* `nla` – National Library of Australia\n", "* `nlnz` – National Library of New Zealand\n", "* `ia` – Internet Archive\n", "* `is` – Icelandic Web Archive\n", "\n", "The `num` parameter is the maximum number of screenshots to create for each year. Of course you only need one screenshot per year for the composite image, but getting two is allows you to select the best capture before generating the composite.\n", "\n", "Screenshots will be saved in a sub-directory of the [screenshots](screenshots) directory. The name of the sub-directory will a slugified version of the url. Screenshots are named with the `url`, `timestamp` and width (which is set at 200 in this notebook). Existing screenshots will not be overwritten, so if the process fails you can easily pick up where it left off by just running the cell again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Edit timegate and url values\n", "get_screenshots(timegate=\"ukgwa\", url=\"http://mod.uk/\", num=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review Screenshots\n", "\n", "Have a look at the results in the [screenshots](screenshots) directory. If you've created more than one per year, choose the best one and delete the others. If you're not satisfied with a particular capture, you can browse the web interface of the archive until you find the capture you're after. Then just copy the url of the capture (it should have a timestamp in it) and feed it directly to the `get_full_page_screenshot()` function as indicated in the cell below. Once you're happy with your collection of screenshots, move on to the next step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use this to add individual captures to fill gaps, or improve screenshots\n", "# Just paste in a url from the archive's web interface\n", "get_full_page_screenshot(\n", " \"http://wayback.vefsafn.is/wayback/20200107111708/https://www.government.is/\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create composite\n", "\n", "Edit the cell below to include the url that you've generated the screenshots from. The script will grap all of the screenshots in the corrresponding directory and build the composite image. The composite will be saved in the `screenshots` directory using a slugified version of the url as a name.\n", "\n", "If you're running this on your own machine, you'll probably need to edit the function to set the font location where indicated." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Edit url as required\n", "make_composite(\"http://digitalnz.org/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }