{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Find when a piece of text appears in an archived web page\n", "\n", "[View in GitHub](https://github.com/GLAM-Workbench/web-archives/blob/master/find-text-in-page-from-timemap.ipynb) · [View in GLAM Workbench](https://glam-workbench.net/web-archives/#find-when-a-piece-of-text-appears-in-an-archived-web-page)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This notebook is designed to run in Voila as an app (with the code hidden).\n", "# To launch this notebook in Voila, just select 'View > Open with Voila in New Browser Tab'\n", "# Your browser might ask for permission to open the new tab as a popup." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook helps you find when a particular piece of text appears in, or disappears from, a web page. Using Memento Timemaps, it gets a list of available captures from the selected web archive. It then searches each capture for the desired text, displaying the results.\n", "\n", "You can select the direction in which the notebook searches:\n", "\n", "* **First occurrence** – find the first capture in which the text appears (start from the first capture and come forward in time)\n", "* **Last occurrence** – find the last capture in which the text appears (start from present and go backwards in time)\n", "* **All occurrences** – find all matches (start from the first capture and continue until the last)\n", "\n", "If you select 'All occurrences' the notebook will generate a simple chart showing how the number of matches changes over time.\n", "\n", "By default, the notebook displays possible or 'fuzzy' matches as well as exact matches, but these are not counted in the totals.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import re\n", "\n", "import altair as alt\n", "import arrow\n", "import ipywidgets as widgets\n", "import pandas as pd\n", "import requests\n", "from bs4 import BeautifulSoup\n", "from fuzzysearch import find_near_matches\n", "from IPython.display import HTML, display\n", "\n", "# This is to restyle the standard html table output from difflib\n", "HTML(\n", " \"\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Default list of repositories -- you could add to this\n", "TIMEGATES = {\n", " \"nla\": \"https://web.archive.org.au/awa/\",\n", " \"nlnz\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"bl\": \"https://www.webarchive.org.uk/wayback/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_html(url):\n", " \"\"\"\n", " Get html from a capture url.\n", " \"\"\"\n", " response = requests.get(url)\n", " # Sometimes the Mementos don't go to captures?!\n", " # Eg https://web.archive.org.au/awa/20090912180610id_/http://www.discontents.com.au/\n", " try:\n", " re.search(r\"/(\\d{14})id_/\", response.url).group(1)\n", " except AttributeError:\n", " return None\n", " return {\"url\": response.url, \"html\": response.content}\n", "\n", "\n", "def format_date(url):\n", " \"\"\"\n", " Extract timestamp from url and format in a human readable way.\n", " \"\"\"\n", " timestamp = re.search(r\"/(\\d{14})id_/\", url).group(1)\n", " return arrow.get(timestamp, \"YYYYMMDDHHmmss\").format(\"D MMMM YYYY\")\n", "\n", "\n", "def format_date_as_iso(url):\n", " \"\"\"\n", " Extract timestamp from url and format as ISO.\n", " \"\"\"\n", " timestamp = re.search(r\"/(\\d{14})id_/\", url).group(1)\n", " return arrow.get(timestamp, \"YYYYMMDDHHmmss\").format(\"YYYY-MM-DD\")\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " # Rename keys\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def get_capture_data_from_memento(url, request_type=\"head\"):\n", " \"\"\"\n", " For OpenWayback systems this can get some extra cpature info to insert in Timemaps.\n", " \"\"\"\n", " if request_type == \"head\":\n", " response = requests.head(url)\n", " else:\n", " response = requests.get(url)\n", " headers = response.headers\n", " length = headers.get(\"x-archive-orig-content-length\")\n", " status = headers.get(\"x-archive-orig-status\")\n", " status = status.split(\" \")[0] if status else None\n", " mime = headers.get(\"x-archive-orig-content-type\")\n", " mime = mime.split(\";\")[0] if mime else None\n", " return {\"length\": length, \"status\": status, \"mime\": mime}\n", "\n", "\n", "def convert_link_to_json(results, enrich_data=False):\n", " \"\"\"\n", " Converts link formatted Timemap to JSON.\n", " \"\"\"\n", " data = []\n", " for line in results.splitlines():\n", " parts = line.split(\"; \")\n", " if len(parts) > 1:\n", " link_type = re.search(\n", " r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n", " parts[1],\n", " ).group(1)\n", " if link_type == \"memento\":\n", " link = parts[0].strip(\"<>\")\n", " timestamp, original = re.search(r\"/(\\d{14})/(.*)$\", link).groups()\n", " capture = {\"timestamp\": timestamp, \"url\": original}\n", " if enrich_data:\n", " capture.update(get_capture_data_from_memento(link))\n", " data.append(capture)\n", " return data\n", "\n", "\n", "def get_timemap_as_json(timegate, url):\n", " \"\"\"\n", " Get a Timemap then normalise results (if necessary) to return a list of dicts.\n", " \"\"\"\n", " tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n", " response = requests.get(tg_url)\n", " response_type = response.headers[\"content-type\"]\n", " # pywb style Timemap\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " # IA Wayback stype Timemap\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " # Link style Timemap (OpenWayback)\n", " elif response_type in [\"application/link-format\", \"text/html;charset=utf-8\"]:\n", " data = convert_link_to_json(response.text)\n", " return data\n", "\n", "\n", "def display_chart(matches):\n", " \"\"\"\n", " Visualise matches over time.\n", " \"\"\"\n", " df = pd.DataFrame(matches)\n", " chart = (\n", " alt.Chart(df)\n", " .mark_line(point=True)\n", " .encode(x=\"date:T\", y=\"matches:Q\", tooltip=[\"date:T\", \"matches:Q\"])\n", " )\n", " with chart_display:\n", " display(chart)\n", "\n", "\n", "def process_text(html):\n", " \"\"\"\n", " Extract text from an HTML page and return it as a list of lines.\n", " Removes blank lines.\n", " \"\"\"\n", " lines = [\n", " line\n", " for line in BeautifulSoup(html).get_text().splitlines()\n", " if not re.match(r\"^\\s*$\", line)\n", " ]\n", " return lines\n", "\n", "\n", "def format_date_link(url):\n", " \"\"\"\n", " Extract date from url, format, and display as link.\n", " \"\"\"\n", " date = format_date(url)\n", " return f'{date}'\n", "\n", "\n", "def format_context(text, match):\n", " \"\"\"\n", " Extract, markup, and format context around a match.\n", " \"\"\"\n", " style = \"p-match\" if match.dist > 0 else \"x-match\"\n", " marked_up = f'{text[:match.start]}{text[match.start:match.end]}{text[match.end:]}'\n", " result_string = marked_up[max(0, match.start - 40) : match.end + 40 + 22 + 7]\n", " result_string = result_string[\n", " result_string.index(\" \") : result_string.rindex(\" \")\n", " ].strip()\n", " return f\"...{result_string}...\"\n", "\n", "\n", "def search_page(capture_data, pattern):\n", " \"\"\"\n", " Search for a text string in the html of a page.\n", " \"\"\"\n", " found = 0\n", " text = BeautifulSoup(capture_data[\"html\"]).get_text()\n", " date = format_date_link(capture_data[\"url\"])\n", " matches = find_near_matches(pattern.casefold(), text.casefold(), max_l_dist=1)\n", " if matches:\n", " results = f'

{date}

\"\n", " with out:\n", " display(HTML(results))\n", " return found\n", "\n", "\n", "def update_status(i, total_matches):\n", " \"\"\"\n", " Display numbers of documents processed and matches found.\n", " \"\"\"\n", " with status:\n", " status.clear_output(wait=True)\n", " display(HTML(f\"Captures processed: {i + 1}\"))\n", " display(HTML(f\"Exact matches found: {total_matches}\"))\n", "\n", "\n", "def find_text(timegate, url, pattern, direction):\n", " \"\"\"\n", " Get all captures for a page from a Timemap, then search for requested text in each page,\n", " aggregating the results.\n", " \"\"\"\n", " total_matches = 0\n", " matches = []\n", " with out:\n", " key = 'Key'\n", " display(HTML(key))\n", " timemap = get_timemap_as_json(timegate, url)\n", " if direction == \"last\":\n", " timemap.reverse()\n", " for i, capture in enumerate(timemap):\n", " capture_url = f'{TIMEGATES[timegate]}{capture[\"timestamp\"]}id_/{capture[\"url\"]}'\n", " if timegate == \"nlnz\" or (\n", " capture[\"digest\"] != timemap[i - 1][\"digest\"] and capture[\"status\"] == \"200\"\n", " ):\n", " capture_data = get_html(capture_url)\n", " if capture_data:\n", " found = search_page(capture_data, pattern)\n", " total_matches += found\n", " if found > 0:\n", " matches.append(\n", " {\"date\": format_date_as_iso(capture_url), \"matches\": found}\n", " )\n", " if direction in [\"first\", \"last\"]:\n", " break\n", " update_status(i, total_matches)\n", " if direction in [\"first\", \"last\"]:\n", " update_status(i, total_matches)\n", " else:\n", " display_chart(matches)\n", "\n", "\n", "def start(e):\n", " clear(\"e\")\n", " find_text(\n", " repository.value, target_url.value, search_string.value, search_direction.value\n", " )\n", "\n", "\n", "def clear(e):\n", " status.clear_output()\n", " chart_display.clear_output()\n", " out.clear_output()\n", "\n", "\n", "out = widgets.Output()\n", "status = widgets.Output()\n", "chart_display = widgets.Output()\n", "\n", "repository = widgets.Dropdown(\n", " options=[\n", " (\"---\", \"\"),\n", " (\"UK Web Archive\", \"bl\"),\n", " (\"National Library of Australia\", \"nla\"),\n", " (\"National Library of New Zealand\", \"nlnz\"),\n", " (\"Internet Archive\", \"ia\"),\n", " ],\n", " description=\"Archive:\",\n", " disabled=False,\n", " value=\"\",\n", ")\n", "\n", "search_direction = widgets.Dropdown(\n", " options=[\n", " (\"First occurrence\", \"first\"),\n", " (\"Last occurrence\", \"last\"),\n", " (\"All occurrences\", \"all\"),\n", " ],\n", " description=\"Find:\",\n", " disabled=False,\n", " value=\"first\",\n", ")\n", "\n", "target_url = widgets.Text(description=\"URL:\")\n", "\n", "search_string = widgets.Text(description=\"Search text:\")\n", "\n", "tc_button = widgets.Button(description=\"Find text\", button_style=\"primary\")\n", "tc_button.on_click(start)\n", "clear_button = widgets.Button(description=\"Clear all\")\n", "clear_button.on_click(clear)\n", "\n", "display(widgets.HBox([repository, target_url], layout=widgets.Layout(padding=\"10px\")))\n", "display(\n", " widgets.HBox(\n", " [search_string, search_direction], layout=widgets.Layout(padding=\"10px\")\n", " )\n", ")\n", "display(widgets.HBox([tc_button, clear_button], layout=widgets.Layout(padding=\"10px\")))\n", "display(status)\n", "display(chart_display)\n", "display(out)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "%load_ext dotenv\n", "%dotenv\n", "\n", "# Insert some values for automated testing\n", "\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " target_url.value = \"http://discontents.com.au/\"\n", " repository.value = \"nla\"\n", " search_string.value = \"Trove\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If values have been provided via url or above, then start automatically.\n", "# Note that Voila widgets don't load immediately, hence the polling to\n", "# make sure the start button exists.\n", "\n", "if target_url.value and search_string.value:\n", " script = \"\"\"\n", " \"\"\"\n", " display(HTML(script))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }