{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvesting collections of text from archived web pages\n", "\n", "

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "This notebook helps you assemble datasets of text extracted from all available captures of archived web pages. You can then feed these datasets to the text analysis tool of your choice to analyse changes over time.\n", "\n", "### Harvest sources\n", "\n", "* Timemaps – harvest text from a single url, or list of urls, using the repository of your choice\n", "* CDX API – harvest text from the results of a query to the Internet Archive's CDX API\n", "\n", "### Options\n", "\n", "* `filter_text=False` (default) – save all of the human visible text on the page, this includes boilerplate, footers, and navigation text.\n", "* `filter_text=True` – save only the significant text on the page, excluding recurring items like boilerplate and navigation. This is done by [Trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html).\n", "\n", "### Usage\n", "\n", "#### Using Timemaps\n", "\n", "``` python\n", "get_texts_for_url([timegate], [url], filter_text=[True or False])\n", "\n", "```\n", "\n", "The `timegate` value should be one of:\n", "\n", "* `nla` – National Library of Australia\n", "* `nlnz` – National Library of New Zealand\n", "* `bl` – UK Web Archive\n", "* `ia` – Internet Archive\n", "* `ukgwa` – UK Government Web Archive\n", "\n", "#### Using the Internet Archive's CDX API\n", "\n", "Use a CDX query to find all urls that include the specified keyword in their url.\n", "\n", "``` python\n", "get_texts_for_cdx_query([url], filter_text=[True or False], filter=['original:.*[keyword].*', 'statuscode:200', 'mimetype:text/html'])\n", "```\n", "\n", "The `url` value can use wildcards to indicate whether it is a domain or prefix query, for example:\n", "\n", "* `nla.gov.au/*` – prefix query, search all files under `nla.gov.au`\n", "* `*.nla.gov.au` – domain query, search all files under `nla.gov.au` and any of its subdomains\n", "\n", "You can use any of the keyword parameters that the CDX API recognises, but you probably want to filter for `statuscode` and `mimetype` and apply some sort of regular expression to `original`.\n", "\n", "### Output\n", "\n", "A directory will be created for each url processed. The name of the directory will be a slugified version of the url in SURT (Sort-friendly URI Reordering Transform) format.\n", "\n", "Each text file will be saved separately within the directory. Filenames follow the pattern: \n", "\n", "```\n", "[SURT formatted url]-[capture timestamp].txt\n", "```\n", "\n", "There's also a `metadata.json` file that includes basic details of the harvest:\n", "\n", "* `timegate` - the repository used\n", "* `url` – the url harvested\n", "* `filter_text` – text filtering option used\n", "* `date` – date and time the harvest was started\n", "* `mementos` – details of each capture, including:\n", " * `url` – link to capture in web archive\n", " * `file_path` – path to harvested text file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import time\n", "from pathlib import Path\n", "\n", "import arrow\n", "import pandas as pd\n", "import requests\n", "import trafilatura\n", "from bs4 import BeautifulSoup\n", "from IPython.display import FileLink, FileLinks, display\n", "from lxml.etree import ParserError\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from slugify import slugify\n", "from surt import surt\n", "from tqdm.auto import tqdm\n", "\n", "s = requests.Session()\n", "retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Default list of repositories -- you could add to this\n", "TIMEGATES = {\n", " \"nla\": \"https://web.archive.org.au/awa/\",\n", " \"nlnz\": \"https://ndhadeliver.natlib.govt.nz/webarchive/\",\n", " \"bl\": \"https://www.webarchive.org.uk/wayback/archive/\",\n", " \"ia\": \"https://web.archive.org/web/\",\n", " \"ukgwa\": \"https://webarchive.nationalarchives.gov.uk/ukgwa/\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def is_memento(url):\n", " \"\"\"\n", " Is this url a Memento? Checks for the presence of a timestamp.\n", " \"\"\"\n", " return bool(re.search(r\"/(\\d{12}|\\d{14})(?:id_|mp_|if_)*/http\", url))\n", "\n", "\n", "def get_html(url):\n", " \"\"\"\n", " Retrieve the original HTML content of an archived page.\n", " Follow redirects if they go to another archived page.\n", " Return the (possibly redirected) url from the response and the HTML content.\n", " \"\"\"\n", " # Adding the id_ hint tells the archive to give us the original harvested version, without any rewriting.\n", " url = re.sub(r\"/(\\d{12}|\\d{14})(?:mp_)*/http\", r\"/\\1id_/http\", url)\n", " response = requests.get(url, allow_redirects=True)\n", " # Some captures might redirect themselves to live versions\n", " # If the redirected url doesn't look like a Memento rerun this without redirection\n", " if not is_memento(response.url):\n", " response = requests.get(url, allow_redirects=False)\n", " return {\"url\": response.url, \"html\": response.content}\n", "\n", "\n", "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " # Rename keys\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def get_capture_data_from_memento(url, request_type=\"head\"):\n", " \"\"\"\n", " For OpenWayback systems this can get some extra cpature info to insert in Timemaps.\n", " \"\"\"\n", " if request_type == \"head\":\n", " response = requests.head(url)\n", " else:\n", " response = requests.get(url)\n", " headers = response.headers\n", " length = headers.get(\"x-archive-orig-content-length\")\n", " status = headers.get(\"x-archive-orig-status\")\n", " status = status.split(\" \")[0] if status else None\n", " mime = headers.get(\"x-archive-orig-content-type\")\n", " mime = mime.split(\";\")[0] if mime else None\n", " return {\"length\": length, \"status\": status, \"mime\": mime}\n", "\n", "\n", "def convert_link_to_json(results, enrich_data=False):\n", " \"\"\"\n", " Converts link formatted Timemap to JSON.\n", " \"\"\"\n", " data = []\n", " for line in results.splitlines():\n", " parts = line.split(\"; \")\n", " if len(parts) > 1:\n", " link_type = re.search(\n", " r'rel=\"(original|self|timegate|first memento|last memento|memento)\"',\n", " parts[1],\n", " ).group(1)\n", " if link_type == \"memento\":\n", " link = parts[0].strip(\"<>\")\n", " timestamp, original = re.search(r\"/(\\d{12}|\\d{14})/(.*)$\", link).groups()\n", " capture = {\"timestamp\": timestamp, \"url\": original}\n", " if enrich_data:\n", " capture.update(get_capture_data_from_memento(link))\n", " data.append(capture)\n", " return data\n", "\n", "\n", "def get_timemap_as_json(timegate, url):\n", " \"\"\"\n", " Get a Timemap then normalise results (if necessary) to return a list of dicts.\n", " \"\"\"\n", " tg_url = f\"{TIMEGATES[timegate]}timemap/json/{url}/\"\n", " response = requests.get(tg_url)\n", " response_type = response.headers[\"content-type\"]\n", " # pywb style Timemap\n", " if response_type == \"text/x-ndjson\":\n", " data = [json.loads(line) for line in response.text.splitlines()]\n", " # IA Wayback stype Timemap\n", " elif response_type == \"application/json\":\n", " data = convert_lists_to_dicts(response.json())\n", " # Link style Timemap (OpenWayback)\n", " elif response_type in [\"application/link-format\", \"text/html;charset=utf-8\"]:\n", " data = convert_link_to_json(response.text)\n", " return data\n", "\n", "\n", "def get_all_text(capture_data):\n", " \"\"\"\n", " Get all the human visible text from a web page, including headers, footers, and navigation.\n", " Does some cleaning up to remove multiple spaces, tabs, and newlines.\n", " \"\"\"\n", " try:\n", " text = BeautifulSoup(capture_data[\"html\"]).get_text()\n", " except TypeError:\n", " return None\n", " else:\n", " # Remove multiple newlines\n", " text = re.sub(r\"\\n\\s*\\n\", \"\\n\\n\", text)\n", " # Remove multiple spaces or tabs with a single space\n", " text = re.sub(r\"( |\\t){2,}\", \" \", text)\n", " # Remove leading spaces\n", " text = re.sub(r\"\\n \", \"\\n\", text)\n", " # Remove leading newlines\n", " text = re.sub(r\"^\\n*\", \"\", text)\n", " return text\n", "\n", "\n", "def get_main_text(capture_data):\n", " \"\"\"\n", " Get only the main text from a page, excluding boilerplate and navigation.\n", " \"\"\"\n", " try:\n", " text = trafilatura.extract(capture_data[\"html\"])\n", " except ParserError:\n", " text = \"\"\n", " return text\n", "\n", "\n", "def get_text_from_capture(capture_url, filter_text=False):\n", " \"\"\"\n", " Get text from the given memento.\n", " If filter_text is True, only return the significant text (excluding things like navigation).\n", " \"\"\"\n", " capture_data = get_html(capture_url)\n", " if filter_text:\n", " text = get_main_text(capture_data)\n", " else:\n", " text = get_all_text(capture_data)\n", " return text\n", "\n", "\n", "def process_capture_list(timegate, captures, filter_text=False, url=None):\n", " if not url:\n", " url = captures[0][\"url\"]\n", " metadata = {\n", " \"timegate\": TIMEGATES[timegate],\n", " \"url\": url,\n", " \"filter_text\": filter_text,\n", " \"date\": arrow.now().format(\"YYYY-MM-DD HH:mm:ss\"),\n", " \"mementos\": [],\n", " }\n", " try:\n", " urlkey = captures[0][\"urlkey\"]\n", " except KeyError:\n", " urlkey = surt(url)\n", " # Truncate urls longer than 50 chars so that filenames are not too long\n", " output_dir = Path(\"text\", slugify(urlkey)[:50])\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", " for capture in tqdm(captures, desc=\"Captures\"):\n", " file_path = Path(\n", " output_dir, f'{slugify(urlkey)[:50]}-{capture[\"timestamp\"]}.txt'\n", " )\n", " # Don't reharvest if file already exists\n", " if not file_path.exists():\n", " # Only process successful captures\n", " if capture[\"status\"] == \"200\":\n", " capture_url = (\n", " f'{TIMEGATES[timegate]}{capture[\"timestamp\"]}id_/{capture[\"url\"]}'\n", " )\n", " capture_text = get_text_from_capture(capture_url, filter_text)\n", " if capture_text:\n", " # Truncate urls longer than 50 chars so that filenames are not too long\n", " file_path = Path(\n", " output_dir, f'{slugify(urlkey)[:50]}-{capture[\"timestamp\"]}.txt'\n", " )\n", " file_path.write_text(capture_text)\n", " metadata[\"mementos\"].append(\n", " {\"url\": capture_url, \"text_file\": str(file_path)}\n", " )\n", " time.sleep(0.2)\n", " metadata_file = Path(output_dir, \"metadata.json\")\n", " with metadata_file.open(\"wt\") as md_json:\n", " json.dump(metadata, md_json)\n", "\n", "\n", "def save_texts_from_url(timegate, url, filter_text=False):\n", " \"\"\"\n", " Save the text contents of all available captures for a given url from the specified repository.\n", " Saves both the harvested text files and a json file with the harvest metadata.\n", " \"\"\"\n", " timemap = get_timemap_as_json(timegate, url)\n", " if timemap:\n", " process_capture_list(timegate, timemap, url=url, filter_text=filter_text)\n", "\n", "\n", "def prepare_params(url, **kwargs):\n", " \"\"\"\n", " Prepare the parameters for a CDX API requests.\n", " Adds all supplied keyword arguments as parameters (changing from_ to from).\n", " Adds in a few necessary parameters.\n", " \"\"\"\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " params[\"pageSize\"] = 5\n", " # CDX accepts a 'from' parameter, but this is a reserved word in Python\n", " # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.\n", " if \"from_\" in params:\n", " params[\"from\"] = params[\"from_\"]\n", " del params[\"from_\"]\n", " return params\n", "\n", "\n", "def get_total_pages(params):\n", " \"\"\"\n", " Get number of pages in a query.\n", " Note that the number of pages doesn't tell you much about the number of results, as the numbers per page vary.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"showNumPages\"] = \"true\"\n", " response = s.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=these_params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " return int(response.text)\n", "\n", "\n", "def get_cdx_data(params):\n", " \"\"\"\n", " Make a request to the CDX API using the supplied parameters.\n", " Return results converted to a list of dicts.\n", " \"\"\"\n", " response = s.get(\"http://web.archive.org/cdx/search/cdx\", params=params)\n", " response.raise_for_status()\n", " results = response.json()\n", " try:\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " except AttributeError:\n", " # Not using cache\n", " time.sleep(0.2)\n", " return convert_lists_to_dicts(results)\n", "\n", "\n", "def harvest_cdx_query(url, **kwargs):\n", " \"\"\"\n", " Harvest results of query from the IA CDX API using pagination.\n", " Returns captures as a list of dicts.\n", " \"\"\"\n", " results = []\n", " page = 0\n", " params = prepare_params(url, **kwargs)\n", " total_pages = get_total_pages(params)\n", " with tqdm(total=total_pages - page, desc=\"CDX\") as pbar:\n", " while page < total_pages:\n", " params[\"page\"] = page\n", " results += get_cdx_data(params)\n", " page += 1\n", " pbar.update(1)\n", " return results\n", "\n", "\n", "def save_texts_from_cdx_query(url, filter_text=False, **kwargs):\n", " captures = harvest_cdx_query(url, **kwargs)\n", " if captures:\n", " df = pd.DataFrame(captures)\n", " groups = df.groupby(by=\"urlkey\")\n", " print(f\"{len(groups)} matching urls\")\n", " for name, group in groups:\n", " process_capture_list(\n", " \"ia\", group.to_dict(\"records\"), filter_text=filter_text\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvesting a single url or list of urls\n", "\n", "Get all human-visible text from all captures of a single url in the Australian Web Archive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save_texts_from_url(\"nla\", \"http://discontents.com.au/\", filter_text=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get only significant text from all captures of a single url in the New Zealand Web Archive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save_texts_from_url(\"nla\", \"http://digitalnz.org/\", filter_text=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Harvest text from a series of urls." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "urls = [\"http://nla.gov.au\", \"http://nma.gov.au\", \"http://awm.gov.au\"]\n", "\n", "for url in urls:\n", " save_texts_from_url(\"nla\", url, filter_text=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvesting matching pages from a domain\n", "\n", "Harvest text from all pages under the `nla.gov.au` domain that include the word 'policy' in the url. Note the use of the regular expression `.*policy.*` to match the `original` url." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "save_texts_from_cdx_query(\n", " \"dfat.gov.au/*\",\n", " filter_text=True,\n", " filter=[\"original:.*policy.*\", \"statuscode:200\", \"mimetype:text/html\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Viewing and downloading the results\n", "\n", "If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the `text` folder. I've also enabled the `jupyter-archive` extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder.\n", "\n", "The cells below provide a couple of alternative ways of viewing and downloading the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Display all the files under the current text folder (this could be a long list)\n", "display(FileLinks(\"text\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Tar/gzip the current domain folder\n", "!tar -czf text.tar.gz text" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Display a link to the gzipped data\n", "# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'\n", "display(FileLink(\"text.tar.gz\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }