{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Find and explore Powerpoint presentations from a specific domain\n", "\n", "![Image from Powerpoint](images/au-gov-defence-army-hna-docs-hna-20general-20presentation-ppt-20070911062046.png) ![Image from Powerpoint](images/au-gov-defence-army-hna-docs-ncw-20model-ppt-20070911061846.png)\n", "\n", "
New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
\n", "\n", "Web archives don't just contain HTML pages! Using the `filter` parameter in CDX queries we can limit our results to particular types of files, for example Powerpoint presentations.\n", "\n", "This notebook helps you find, download, and explore all the presentation files captured from a particular domain, like `defence.gov.au`. It uses the Internet Archive by default, as their CDX API allows domain level queries and pagination, however, you could try using the UKWA or the National Library of Australia (prefix queries only).\n", "\n", "This notebook includes a series of processing steps:\n", "\n", "1. Harvest capture data\n", "2. Remove duplicates from capture data and download files\n", "3. Convert Powerpoint files to PDFs\n", "4. Extract screenshots and text from the PDFs\n", "5. Save metadata, screenshots, and text into an SQLite database for exploration\n", "6. Open the SQLite db in Datasette for exploration\n", "\n", "Here's an [example of the SQLite database](https://defencegovau-powerpoints.glitch.me/) created by harvesting Powerpoint files from the `defence.gov.au` domain, running in Datasette on Glitch.\n", "\n", "Moving large files around and extracting useful data from proprietary formats is not a straightforward process. While this notebook has been tested and will work running on Binder, you'll probably want to shift across to a local machine if you're doing any large-scale harvesting. That'll make it easier for you to deal with corrupted files, broken downloads etc.\n", "\n", "For more examples of harvesting domain-level data see [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import io\n", "import math\n", "import time\n", "from pathlib import Path\n", "from urllib.parse import urlparse\n", "\n", "import arrow\n", "\n", "# pyMuPDF (aka Fitz) seems to do a better job of converting PDFs to images than pdf2image\n", "import fitz\n", "import pandas as pd\n", "import PIL\n", "import requests\n", "import sqlite_utils\n", "from IPython.display import HTML, FileLink, FileLinks, display\n", "from jupyter_server import serverapp\n", "from notebook import notebookapp\n", "from PIL import Image\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.exceptions import MaxRetryError\n", "from requests.packages.urllib3.util.retry import Retry\n", "from slugify import slugify\n", "from tqdm.auto import tqdm\n", "\n", "# Also need LibreOffice installed by whatever means is needed for local system\n", "\n", "s = requests.Session()\n", "retries = Retry(total=3, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def convert_lists_to_dicts(results):\n", " \"\"\"\n", " Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n", " Renames keys to standardise IA with other Timemaps.\n", " \"\"\"\n", " if results:\n", " keys = results[0]\n", " results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n", " else:\n", " results_as_dicts = results\n", " for d in results_as_dicts:\n", " d[\"status\"] = d.pop(\"statuscode\")\n", " d[\"mime\"] = d.pop(\"mimetype\")\n", " d[\"url\"] = d.pop(\"original\")\n", " return results_as_dicts\n", "\n", "\n", "def get_total_pages(params):\n", " these_params = params.copy()\n", " these_params[\"showNumPages\"] = \"true\"\n", " response = s.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=these_params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " return int(response.text)\n", "\n", "\n", "def query_cdx_ia(url, **kwargs):\n", " results = []\n", " page = 0\n", " params = kwargs\n", " params[\"url\"] = url\n", " params[\"output\"] = \"json\"\n", " total_pages = get_total_pages(params)\n", " # print(total_pages)\n", " with tqdm(total=total_pages - page) as pbar:\n", " while page < total_pages:\n", " params[\"page\"] = page\n", " response = s.get(\n", " \"http://web.archive.org/cdx/search/cdx\",\n", " params=params,\n", " headers={\"User-Agent\": \"\"},\n", " )\n", " # print(response.url)\n", " response.raise_for_status()\n", " results += convert_lists_to_dicts(response.json())\n", " page += 1\n", " pbar.update(1)\n", " time.sleep(0.2)\n", " return results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvest capture data\n", "\n", "See [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb) and [Comparing CDX APIs](comparing_cdx_apis.ipynb) for more information on getting CDX data. \n", "\n", "Setting `filter=mimetype:application/vnd.ms-powerpoint` would get standard MS `.ppt` files, but what about `.pptx` or Open Office files? Using a regular expression like `filter=mimetype:.*(powerpoint|presentation).*` should get most presentation software variants." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Set the domain that we're interested in\n", "domain = \"defence.gov.au\"\n", "\n", "# Create output directory for data\n", "domain_dir = Path(\"domains\", slugify(domain))\n", "domain_dir.mkdir(parents=True, exist_ok=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Domain or prefix search? Domain...\n", "# Collapse on digest? Only removes adjacent captures with the same digest, so probably won't make much difference\n", "# Note the use of regex in the mimetype filter -- should capture all(?) presentations.\n", "results = query_cdx_ia(f\"*.{domain}\", filter=\"mimetype:.*(powerpoint|presentation).*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's convert the harvested data into a dataframe." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df = pd.DataFrame(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many captures are there?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/plain": [ "(541, 7)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many unique files are there? This drops duplicates based on the `digest` field." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_unique = df.drop_duplicates(subset=[\"digest\"], keep=\"first\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/plain": [ "(269, 7)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_unique.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What `mimetype` values are there?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/plain": [ "application/vnd.ms-powerpoint 267\n", "application/vnd.openxmlformats-officedocument.presentationml.presentation 2\n", "Name: mime, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_unique[\"mime\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download all the Powerpoint files and save some metadata about them\n", "\n", "In this step we'll work through the capture data and attempt to download each unique file. We'll also add in some extra metadata:\n", "\n", "* `first_capture` – the first time this file appears in the archive (ISO formatted date, YYYY-MM-DD)\n", "* `last_capture` – the last time this file appears in the archive (ISO formatted date, YYYY-MM-DD)\n", "* `current_status` – HTTP status code returned now by the original url\n", "\n", "The downloaded Powerpoint files will be stored in `/domains/[this domain]/powerpoints/original`. The script won't overwrite any existing files, so if it fails and you need to restart, it'll just pick up from where it left off." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def format_timestamp_as_iso(timestamp):\n", " return arrow.get(timestamp, \"YYYYMMDDHHmmss\").format(\"YYYY-MM-DD\")\n", "\n", "\n", "def get_date_range(df, digest):\n", " \"\"\"\n", " Find the first and last captures identified by digest.\n", " Return the dates in ISO format.\n", " \"\"\"\n", " captures = df.loc[df[\"digest\"] == digest]\n", " from_date = format_timestamp_as_iso(captures[\"timestamp\"].min())\n", " to_date = format_timestamp_as_iso(captures[\"timestamp\"].max())\n", " return (from_date, to_date)\n", "\n", "\n", "def check_if_exists(url):\n", " \"\"\"\n", " Check to see if a file still exists on the web.\n", " Return the current status code.\n", " \"\"\"\n", " try:\n", " response = s.head(url, allow_redirects=True)\n", " # If there are any problems that don't generate codes\n", " except (\n", " requests.exceptions.ConnectionError,\n", " requests.exceptions.RetryError,\n", " MaxRetryError,\n", " ):\n", " return \"unreachable\"\n", " return response.status_code\n", "\n", "\n", "def save_files(df, add_current_status=True):\n", " \"\"\"\n", " Attempts to download each unique file.\n", " Adds first_capture, and last_capture, and file_path to the file metadata.\n", " Optionally checks to see if file is still accessible on the live web, adding current status code to metadata.\n", " \"\"\"\n", " metadata = df.drop_duplicates(subset=[\"digest\"], keep=\"first\").to_dict(\"records\")\n", " for row in tqdm(metadata):\n", " url = f'https://web.archive.org/web/{row[\"timestamp\"]}id_/{row[\"url\"]}'\n", " parsed = urlparse(row[\"url\"])\n", " suffix = Path(parsed.path).suffix\n", " # This should give a sortable and unique filename if there are multiple versions of a file\n", " file_name = f'{slugify(row[\"urlkey\"])}-{row[\"timestamp\"]}{suffix}'\n", " # print(filename)\n", " # Create the output directory for the downloaded files\n", " output_dir = Path(domain_dir, \"powerpoints\", \"original\")\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", " file_path = Path(output_dir, file_name)\n", " # Don't re-get files we already got\n", " if not file_path.exists():\n", " response = s.get(url=url)\n", " file_path.write_bytes(response.content)\n", " time.sleep(1)\n", " # Get first and last cpature dates\n", " first, last = get_date_range(df, row[\"digest\"])\n", " row[\"first_capture\"] = first\n", " row[\"last_capture\"] = last\n", " # This can slow things down a bit, so make it optional\n", " if add_current_status:\n", " row[\"current_status\"] = check_if_exists(row[\"url\"])\n", " row[\"file_path\"] = str(file_path)\n", " row[\"memento\"] = f'https://web.archive.org/web/{row[\"timestamp\"]}/{row[\"url\"]}'\n", " return metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "metadata = save_files(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save the enriched data as a CSV file." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_md = pd.DataFrame(metadata)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/html": [ "domains/defence-gov-au/defence-gov-au-powerpoints.csvView Datasette (Click on the stop button in the top menu bar to close the Datasette server)
'\n", " )\n", ")\n", "\n", "# Launch Datasette\n", "!datasette -- {str(db_file)} --port 8001 --config base_url:{proxy_url} --config truncate_cells_html:100 --plugins-dir plugins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Viewing or downloading files\n", "\n", "If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the `domains` folder. I've also enabled the `jupyter-archive` extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder. Remember, however, that some of these folders will contain a LOT of data, so downloading everything might not always work.\n", "\n", "The cells below provide a couple of alternative ways of viewing and downloading the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Display all the files under the current domain folder (this could be a long list)\n", "display(FileLinks(str(domain_dir)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Tar/gzip the current domain folder\n", "!tar -czf {str(domain_dir)}.tar.gz {str(domain_dir)}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Display a link to the gzipped data\n", "# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'\n", "display(FileLink(f\"{str(domain_dir)}.tar.gz\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n", "\n", "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n", "\n", "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }