{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Find and explore Powerpoint presentations from a specific domain\n",
    "\n",
    "![Image from Powerpoint](images/au-gov-defence-army-hna-docs-hna-20general-20presentation-ppt-20070911062046.png) ![Image from Powerpoint](images/au-gov-defence-army-hna-docs-ncw-20model-ppt-20070911061846.png)\n",
    "\n",
    "<p class=\"alert alert-info\">New to Jupyter notebooks? Try <a href=\"getting-started/Using_Jupyter_notebooks.ipynb\"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>\n",
    "\n",
    "Web archives don't just contain HTML pages! Using the `filter` parameter in CDX queries we can limit our results to particular types of files, for example Powerpoint presentations.\n",
    "\n",
    "This notebook helps you find, download, and explore all the presentation files captured from a particular domain, like `defence.gov.au`. It uses the Internet Archive by default, as their CDX API allows domain level queries and pagination, however, you could try using the UKWA or the National Library of Australia (prefix queries only).\n",
    "\n",
    "This notebook includes a series of processing steps:\n",
    "\n",
    "1. Harvest capture data\n",
    "2. Remove duplicates from capture data and download files\n",
    "3. Convert Powerpoint files to PDFs\n",
    "4. Extract screenshots and text from the PDFs\n",
    "5. Save metadata, screenshots, and text into an SQLite database for exploration\n",
    "6. Open the SQLite db in Datasette for exploration\n",
    "\n",
    "Here's an [example of the SQLite database](https://defencegovau-powerpoints.glitch.me/) created by harvesting Powerpoint files from the `defence.gov.au` domain, running in Datasette on Glitch.\n",
    "\n",
    "Moving large files around and extracting useful data from proprietary formats is not a straightforward process. While this notebook has been tested and will work running on Binder, you'll probably want to shift across to a local machine if you're doing any large-scale harvesting. That'll make it easier for you to deal with corrupted files, broken downloads etc.\n",
    "\n",
    "For more examples of harvesting domain-level data see [Harvesting data about a domain using the IA CDX API](harvesting_domain_data.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import math\n",
    "import time\n",
    "from pathlib import Path\n",
    "from urllib.parse import urlparse\n",
    "\n",
    "import arrow\n",
    "\n",
    "# pyMuPDF (aka Fitz) seems to do a better job of converting PDFs to images than pdf2image\n",
    "import fitz\n",
    "import pandas as pd\n",
    "import PIL\n",
    "import requests\n",
    "import sqlite_utils\n",
    "from IPython.display import HTML, FileLink, FileLinks, display\n",
    "from jupyter_server import serverapp\n",
    "from notebook import notebookapp\n",
    "from PIL import Image\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.exceptions import MaxRetryError\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from slugify import slugify\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "# Also need LibreOffice installed by whatever means is needed for local system\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=3, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_lists_to_dicts(results):\n",
    "    \"\"\"\n",
    "    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.\n",
    "    Renames keys to standardise IA with other Timemaps.\n",
    "    \"\"\"\n",
    "    if results:\n",
    "        keys = results[0]\n",
    "        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]\n",
    "    else:\n",
    "        results_as_dicts = results\n",
    "    for d in results_as_dicts:\n",
    "        d[\"status\"] = d.pop(\"statuscode\")\n",
    "        d[\"mime\"] = d.pop(\"mimetype\")\n",
    "        d[\"url\"] = d.pop(\"original\")\n",
    "    return results_as_dicts\n",
    "\n",
    "\n",
    "def get_total_pages(params):\n",
    "    these_params = params.copy()\n",
    "    these_params[\"showNumPages\"] = \"true\"\n",
    "    response = s.get(\n",
    "        \"http://web.archive.org/cdx/search/cdx\",\n",
    "        params=these_params,\n",
    "        headers={\"User-Agent\": \"\"},\n",
    "    )\n",
    "    return int(response.text)\n",
    "\n",
    "\n",
    "def query_cdx_ia(url, **kwargs):\n",
    "    results = []\n",
    "    page = 0\n",
    "    params = kwargs\n",
    "    params[\"url\"] = url\n",
    "    params[\"output\"] = \"json\"\n",
    "    total_pages = get_total_pages(params)\n",
    "    # print(total_pages)\n",
    "    with tqdm(total=total_pages - page) as pbar:\n",
    "        while page < total_pages:\n",
    "            params[\"page\"] = page\n",
    "            response = s.get(\n",
    "                \"http://web.archive.org/cdx/search/cdx\",\n",
    "                params=params,\n",
    "                headers={\"User-Agent\": \"\"},\n",
    "            )\n",
    "            # print(response.url)\n",
    "            response.raise_for_status()\n",
    "            results += convert_lists_to_dicts(response.json())\n",
    "            page += 1\n",
    "            pbar.update(1)\n",
    "            time.sleep(0.2)\n",
    "    return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest capture data\n",
    "\n",
    "See [Exploring the Internet Archive's CDX API](exploring_cdx_api.ipynb) and [Comparing CDX APIs](comparing_cdx_apis.ipynb) for more information on getting CDX data. \n",
    "\n",
    "Setting `filter=mimetype:application/vnd.ms-powerpoint` would get standard MS `.ppt` files, but what about `.pptx` or Open Office files? Using a regular expression like `filter=mimetype:.*(powerpoint|presentation).*` should get most presentation software variants."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Set the domain that we're interested in\n",
    "domain = \"defence.gov.au\"\n",
    "\n",
    "# Create output directory for data\n",
    "domain_dir = Path(\"domains\", slugify(domain))\n",
    "domain_dir.mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Domain or prefix search? Domain...\n",
    "# Collapse on digest? Only removes adjacent captures with the same digest, so probably won't make much difference\n",
    "# Note the use of regex in the mimetype filter -- should capture all(?) presentations.\n",
    "results = query_cdx_ia(f\"*.{domain}\", filter=\"mimetype:.*(powerpoint|presentation).*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's convert the harvested data into a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many captures are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(541, 7)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many unique files are there? This drops duplicates based on the `digest` field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df_unique = df.drop_duplicates(subset=[\"digest\"], keep=\"first\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(269, 7)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_unique.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What `mimetype` values are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "application/vnd.ms-powerpoint                                                267\n",
       "application/vnd.openxmlformats-officedocument.presentationml.presentation      2\n",
       "Name: mime, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_unique[\"mime\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download all the Powerpoint files and save some metadata about them\n",
    "\n",
    "In this step we'll work through the capture data and attempt to download each unique file. We'll also add in some extra metadata:\n",
    "\n",
    "* `first_capture` – the first time this file appears in the archive (ISO formatted date, YYYY-MM-DD)\n",
    "* `last_capture` – the last time this file appears in the archive (ISO formatted date, YYYY-MM-DD)\n",
    "* `current_status` – HTTP status code returned now by the original url\n",
    "\n",
    "The downloaded Powerpoint files will be stored in `/domains/[this domain]/powerpoints/original`. The script won't overwrite any existing files, so if it fails and you need to restart, it'll just pick up from where it left off."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_timestamp_as_iso(timestamp):\n",
    "    return arrow.get(timestamp, \"YYYYMMDDHHmmss\").format(\"YYYY-MM-DD\")\n",
    "\n",
    "\n",
    "def get_date_range(df, digest):\n",
    "    \"\"\"\n",
    "    Find the first and last captures identified by digest.\n",
    "    Return the dates in ISO format.\n",
    "    \"\"\"\n",
    "    captures = df.loc[df[\"digest\"] == digest]\n",
    "    from_date = format_timestamp_as_iso(captures[\"timestamp\"].min())\n",
    "    to_date = format_timestamp_as_iso(captures[\"timestamp\"].max())\n",
    "    return (from_date, to_date)\n",
    "\n",
    "\n",
    "def check_if_exists(url):\n",
    "    \"\"\"\n",
    "    Check to see if a file still exists on the web.\n",
    "    Return the current status code.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        response = s.head(url, allow_redirects=True)\n",
    "    # If there are any problems that don't generate codes\n",
    "    except (\n",
    "        requests.exceptions.ConnectionError,\n",
    "        requests.exceptions.RetryError,\n",
    "        MaxRetryError,\n",
    "    ):\n",
    "        return \"unreachable\"\n",
    "    return response.status_code\n",
    "\n",
    "\n",
    "def save_files(df, add_current_status=True):\n",
    "    \"\"\"\n",
    "    Attempts to download each unique file.\n",
    "    Adds first_capture, and last_capture, and file_path to the file metadata.\n",
    "    Optionally checks to see if file is still accessible on the live web, adding current status code to metadata.\n",
    "    \"\"\"\n",
    "    metadata = df.drop_duplicates(subset=[\"digest\"], keep=\"first\").to_dict(\"records\")\n",
    "    for row in tqdm(metadata):\n",
    "        url = f'https://web.archive.org/web/{row[\"timestamp\"]}id_/{row[\"url\"]}'\n",
    "        parsed = urlparse(row[\"url\"])\n",
    "        suffix = Path(parsed.path).suffix\n",
    "        # This should give a sortable and unique filename if there are multiple versions of a file\n",
    "        file_name = f'{slugify(row[\"urlkey\"])}-{row[\"timestamp\"]}{suffix}'\n",
    "        # print(filename)\n",
    "        # Create the output directory for the downloaded files\n",
    "        output_dir = Path(domain_dir, \"powerpoints\", \"original\")\n",
    "        output_dir.mkdir(parents=True, exist_ok=True)\n",
    "        file_path = Path(output_dir, file_name)\n",
    "        # Don't re-get files we already got\n",
    "        if not file_path.exists():\n",
    "            response = s.get(url=url)\n",
    "            file_path.write_bytes(response.content)\n",
    "            time.sleep(1)\n",
    "        # Get first and last cpature dates\n",
    "        first, last = get_date_range(df, row[\"digest\"])\n",
    "        row[\"first_capture\"] = first\n",
    "        row[\"last_capture\"] = last\n",
    "        # This can slow things down a bit, so make it optional\n",
    "        if add_current_status:\n",
    "            row[\"current_status\"] = check_if_exists(row[\"url\"])\n",
    "        row[\"file_path\"] = str(file_path)\n",
    "        row[\"memento\"] = f'https://web.archive.org/web/{row[\"timestamp\"]}/{row[\"url\"]}'\n",
    "    return metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "metadata = save_files(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's save the enriched data as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df_md = pd.DataFrame(metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='domains/defence-gov-au/defence-gov-au-powerpoints.csv' target='_blank'>domains/defence-gov-au/defence-gov-au-powerpoints.csv</a><br>"
      ],
      "text/plain": [
       "/Volumes/Workspace/mycode/glam-workbench/webarchives/notebooks/domains/defence-gov-au/defence-gov-au-powerpoints.csv"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "csv_file = Path(domain_dir, f\"{slugify(domain)}-powerpoints.csv\")\n",
    "df_md.to_csv(csv_file, index=False)\n",
    "display(FileLink(csv_file))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many of the archived Powerpoints are still accessible on the live web?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "404            198\n",
       "200             70\n",
       "unreachable      1\n",
       "Name: current_status, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_md[\"current_status\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert the Powerpoint files to PDFs\n",
    "\n",
    "In this step we'll convert the Powerpoint files to PDFs. PDFs are easier to work with, and we can use them to extract text and images from the original presentation files. However, the conversion process can be a bit fiddly. Before you undertake this you'll need to make sure that you have a **recent version of [LibreOffice](https://www.libreoffice.org/) installed** on your system.\n",
    "\n",
    "If you're using this notebook on Binder, or have created a `virtualenv` using this repository's `requirements.txt`, you'll have installed the Python command-line program `unoconv`. Unoconv wraps around LibreOffice to provide a standard interface for converting files from one file type to another. However, while `unoconv` seems to work ok when running a notebook in a conda environment, it fails in a standard `virtualenv`. It's got something to do with the default interpreter, but even using LibreOffice's own packaged Python doesn't seem to help.\n",
    "\n",
    "Fortunately, you can just call LibreOffice directly. To do this, you'll need to know the location of the LibreOffice program. for example, on MacOS this is probably `/Applications/LibreOffice.app/Contents/MacOS/soffice`. Once you have that, it's easy to convert a directory full of Powerpoint files to pdfs, for example:\n",
    "\n",
    "``` shell\n",
    "!/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to [format to convert to, eg pdf] --outdir [output directory] [input files]\n",
    "```\n",
    "\n",
    "I've included both the `unoconv` and direct LibreOffice options below so you can use the one that works best for you. Just comment/uncomment lines as necessary. **If you're using this notebook on Binder, it *should just work* with the default settings.**\n",
    "\n",
    "Some other issues:\n",
    "\n",
    "* Both commands might fail silently if LibreOffice is already running on your system – close it first! If a previous call to LibreOffice has failed in some ugly way, it might still be running in the background, so if nothing seems to be happening, hunt around for a running LibreOffice process.\n",
    "* Both commands might fail in unhelpful ways on corrupted or malformed Powerpoints. If you keep getting stuck on a particular file, just try moving or deleting it and then run the conversion command again.\n",
    "* I've seen reports that suggest the comands might fail if you feed in lots of files. I've processed > 100 without problem, but if this does happen it might be necessary to loop through individual files, or break them up into groups.\n",
    "* Because of missing fonts or general weirdness, the PDFs are unlikely to be exact representations of the Powerpoint files. But the point of this exercise is to give some insight into the contents of the files, rather than create perfectly styled PDFs. Again, you'll probably get better results running on your local machine where you've got a range of fonts installed.\n",
    "* If you use the LibreOffice program directly, you'll see a lot of output which might include some warnings. Don't worry unless it fails completely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "input_dir = str(Path(domain_dir, \"powerpoints\", \"original\")) + \"/*[ppt,pps,pptx]\"\n",
    "output_dir = Path(domain_dir, \"powerpoints\", \"pdfs\")\n",
    "output_dir.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Try with unoconv?\n",
    "# !unoconv -v -f pdf -o {pdf_output} {input_dir}\n",
    "\n",
    "# Using typical Libre Office location on Ubuntu (ie Binder)\n",
    "!/usr/bin/soffice --headless --convert-to pdf --outdir {output_dir} {input_dir}\n",
    "\n",
    "# Typical MacOS installation of most recent LibreOffice\n",
    "# !/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir {output_dir} {input_dir}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate screenshots and save data from PDFs\n",
    "\n",
    "Now that we have PDF-formatted versions of the presentations, we can save individual slides as images, and extract all the text from the presentation. The function below works its way through out metadata file, looking for the PDF version of each of our downloaded Powerpoint files. If the PDF exists it:\n",
    "\n",
    "* Saves the first slide (ie the first page in the PDF) as a PNG image in the `/domains/[your selected domain]/powerpoints/images` directory.\n",
    "* Resizes the image to a maximum width of 300px and saves it as binary data in the metadata file.\n",
    "* Loops through all the pages to extract the complete text of the presentation, and saves to text to the metadata file.\n",
    "\n",
    "We end up with a metadata file that includes the full text of the file as well as an image of the first page. In the next step, we'll turn this into a database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_data_from_pdfs(metadata):\n",
    "    pdf_data = metadata.copy()\n",
    "    for pdf in tqdm(pdf_data):\n",
    "        # See if pdf exists\n",
    "        pdf_name = f'{slugify(pdf[\"urlkey\"])}-{pdf[\"timestamp\"]}.pdf'\n",
    "        # print(filename)\n",
    "        input_dir = Path(domain_dir, \"powerpoints\", \"pdfs\")\n",
    "        output_dir = Path(domain_dir, \"powerpoints\", \"images\")\n",
    "        output_dir.mkdir(parents=True, exist_ok=True)\n",
    "        pdf_path = Path(input_dir, pdf_name)\n",
    "        if pdf_path.exists():\n",
    "            doc = fitz.open(pdf_path)\n",
    "            page_count = doc.pageCount\n",
    "            page_num = 0\n",
    "            text = \"\"\n",
    "            # You can define matrix to increase resolution / size - https://pymupdf.readthedocs.io/en/latest/faq/\n",
    "            # Increase numbers below to zoom\n",
    "            mat = fitz.Matrix(1, 1)\n",
    "            while page_num < page_count:\n",
    "                page = doc.load_page(page_num)\n",
    "                # Convert first page to image\n",
    "                if page_num == 0:\n",
    "                    # Get page as image and save\n",
    "                    pix = page.get_pixmap(matrix=mat)\n",
    "                    image_file = Path(output_dir, f\"{pdf_path.stem}.png\")\n",
    "                    pix.save(str(image_file))\n",
    "                    # Resize the image as a thumbnail for saving in db\n",
    "                    img_data = pix.tobytes()\n",
    "                    img = Image.open(io.BytesIO(img_data))\n",
    "                    ratio = 300 / img.width\n",
    "                    (width, height) = (300, math.ceil(img.height * ratio))\n",
    "                    resized_img = img.resize(\n",
    "                        (width, height), PIL.Image.Resampling.LANCZOS\n",
    "                    )\n",
    "                    # Save as data\n",
    "                    buffer = io.BytesIO()\n",
    "                    resized_img.save(buffer, format=\"PNG\")\n",
    "                    pdf[\"image\"] = buffer.getvalue()\n",
    "                text += page.get_text()\n",
    "                page_num += 1\n",
    "            pdf[\"text\"] = text\n",
    "        else:\n",
    "            pdf[\"text\"] = \"\"\n",
    "            pdf[\"image\"] = \"\"\n",
    "    return pdf_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "pdf_data = get_data_from_pdfs(metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save into SQLite so we can explore the results in Datasette\n",
    "\n",
    "In this step we import the enriched metadata file into a SQLite database using [SQlite-utils](https://sqlite-utils.readthedocs.io/en/stable/index.html). Then we can open the database in [Datasette](https://datasette.readthedocs.io/en/stable/index.html) and explore the results.\n",
    "\n",
    "There's a few interesting things to note:\n",
    "\n",
    "* Using `sql-utils` we can index the `text` column so that it's searchable as full text.\n",
    "* To get Datasette to work on Binder we have to find the base url of the Jupyter proxy and give it to Datasette."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Create the db\n",
    "db_file = Path(domain_dir, f\"{slugify(domain)}-powerpoints.db\")\n",
    "db = sqlite_utils.Database(db_file)\n",
    "db[\"files\"].delete_where()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Insert the metadata into the db\n",
    "db[\"files\"].insert_all(\n",
    "    pdf_data,\n",
    "    column_order=(\n",
    "        \"image\",\n",
    "        \"timestamp\",\n",
    "        \"memento\",\n",
    "        \"first_capture\",\n",
    "        \"last_capture\",\n",
    "        \"current_status\",\n",
    "        \"text\",\n",
    "        \"mime\",\n",
    "        \"status\",\n",
    "        \"url\",\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Make the text column full text searchable\n",
    "db[\"files\"].enable_fts([\"text\"], fts_version=\"FTS4\")\n",
    "\n",
    "# This seems to cause db is locked errors on Binder\n",
    "# db['files'].optimize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code should launch Datasette correctly on either Binder or a local system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Get current running servers\n",
    "servers = serverapp.list_running_servers()\n",
    "\n",
    "try:\n",
    "    # Get the current base url\n",
    "    base_url = next(servers)[\"base_url\"]\n",
    "except StopIteration:\n",
    "    # Binder uses notebook server\n",
    "    servers = notebookapp.list_running_servers()\n",
    "    base_url = next(servers)[\"base_url\"]\n",
    "\n",
    "# Create a base url for Datasette using the proxy path\n",
    "proxy_url = f\"{base_url}proxy/absolute/8001/\"\n",
    "\n",
    "# Display a link to Datasette\n",
    "display(\n",
    "    HTML(\n",
    "        f'<p><a style=\"display: block; border: 1px solid #307fc1; background-color: #1976d2; color: #ffffff; padding: 10px; text-align: center;\"href=\"{proxy_url}\">View Datasette</a> (Click on the stop button in the top menu bar to close the Datasette server)</p>'\n",
    "    )\n",
    ")\n",
    "\n",
    "# Launch Datasette\n",
    "!datasette -- {str(db_file)} --port 8001 --config base_url:{proxy_url} --config truncate_cells_html:100 --plugins-dir plugins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Viewing or downloading files\n",
    "\n",
    "If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the `domains` folder. I've also enabled the `jupyter-archive` extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder. Remember, however, that some of these folders will contain a LOT of data, so downloading everything might not always work.\n",
    "\n",
    "The cells below provide a couple of alternative ways of viewing and downloading the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Display all the files under the current domain folder (this could be a long list)\n",
    "display(FileLinks(str(domain_dir)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Tar/gzip the current domain folder\n",
    "!tar -czf {str(domain_dir)}.tar.gz {str(domain_dir)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Display a link to the gzipped data\n",
    "# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'\n",
    "display(FileLink(f\"{str(domain_dir)}.tar.gz\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!\n",
    "\n",
    "Work on this notebook was supported by the [IIPC Discretionary Funding Programme 2019-2020](http://netpreserve.org/projects/).\n",
    "\n",
    "The Web Archives section of the GLAM Workbench is sponsored by the [British Library](https://www.bl.uk/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}