{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Display the results of a harvest as a searchable database using Datasette\n", "\n", "You've harvested lots of newspaper articles from Trove using the [Newspaper Harvester](https://glam-workbench.github.io/trove-harvester/). But how do you examine the results? Before you fire up Excel (which is likely to weird things with you dates), give Datasette a try!\n", "\n", "[Datasette](https://github.com/simonw/datasette) is 'a tool for exploring and publishing data'. Give it a CSV file and it turns it into a fully-searchable database, running in your browser. It supports facets, full-text search, and, with a bit of tweaking, can even present images. Although Datasette is a command-line tool, we can run from within a Jupyter notebook, and open a new window to display the results. This notebook shows you how to load the newspaper data you've harvested into Datasette, and start it up. If you've also harvested full-text and images from the newspaper articles, you can add these to your database as well!\n", "\n", "![Screenshot of Datasette](images/datasette-row.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import json\n", "import os\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import sqlite_utils\n", "from IPython.display import HTML, display\n", "from jupyter_server import serverapp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load harvested data into Datasette\n", "\n", "By default, the cells below load the most recently completed newspaper harvest into Datasette. If you want to load a different harvest, simply supply the harvest's `timestamp` when you run `open_datasette()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_latest_harvest():\n", " \"\"\"\n", " Get the timestamp of the most recent harvest.\n", " \"\"\"\n", " harvests = sorted(\n", " [d for d in os.listdir(\"data\") if os.path.isdir(os.path.join(\"data\", d))]\n", " )\n", " return harvests[-1]\n", "\n", "\n", "def open_harvest_data(timestamp=None):\n", " \"\"\"\n", " Open the results of the specified harvest (most recent by default).\n", " Returns a list of records and a timestamp.\n", " \"\"\"\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " df = pd.read_csv(Path(\"data\", timestamp, \"results.csv\"))\n", " return df.to_dict(\"records\")\n", "\n", "\n", "def create_db(timestamp=None):\n", " \"\"\"\n", " Create a db named with the supplied timestamp (or the timestamp of the latest harvest).\n", " Load the CSV data from the harvest.\n", " \"\"\"\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " db_path = Path(\"data\", timestamp, \"results.db\")\n", " # Delete an existing db\n", " db_path.unlink(missing_ok=True)\n", " # Create a new db\n", " # Get the harvest data\n", " data = open_harvest_data(timestamp)\n", " # Create the db\n", " db = sqlite_utils.Database(db_path)\n", " # Load the data, specifying `article_id` as the primary key\n", " db[\"records\"].insert_all(data, pk=\"article_id\")\n", " return db_path\n", "\n", "\n", "def get_db(timestamp=None):\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " db_path = Path(\"data\", timestamp, \"results.db\")\n", " if not db_path.exists:\n", " db_path = create_db(timestamp)\n", " return db_path\n", "\n", "\n", "def get_proxy_url():\n", " # Get current running servers\n", " servers = serverapp.list_running_servers()\n", " base_url = next(servers)[\"base_url\"]\n", " \"\"\"\n", " try:\n", " # Get the current base url\n", " base_url = next(servers)[\"base_url\"]\n", " except StopIteration:\n", " # Binder uses notebook server\n", " servers = notebookapp.list_running_servers()\n", " base_url = next(servers)[\"base_url\"]\n", " \"\"\"\n", " # Create a base url for Datasette using the proxy path\n", " proxy_url = f\"{base_url}proxy/8001/\"\n", " return proxy_url\n", "\n", "\n", "def open_datasette(timestamp=None):\n", " \"\"\"\n", " This gets the base url of the currently running notebook. It then uses this url to\n", " construct a link to your Datasette instance, using jupyter-server-proxy.\n", " Finally it creates a button to open up a new tab to view your database.\n", " \"\"\"\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " db_path = str(get_db(timestamp)) # noqa: F841\n", " proxy_url = get_proxy_url()\n", " # Display a link to Datasette\n", " display(\n", " HTML(\n", " f'

View in Datasette (Click on the stop button in the top menu bar to close the Datasette server)

'\n", " )\n", " )\n", " # Launch Datasette\n", " !datasette -- {db_path} --port 8001 --config base_url:{proxy_url} --config truncate_cells_html:100 --setting facet_suggest_time_limit_ms 100 --setting facet_time_limit_ms 1000 --setting sql_time_limit_ms 5000 --metadata metadata.yml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add OCRd text and make it searchable\n", "\n", "The Trove newspaper harvester saves the OCRd text of each article into a separate text file. This is to make the harvest more manageable. But you can easily insert the text into Datasette and make it fully-searchable. You might remember that the text files are named using the article id, so we can just grab a file, extract the id, look up the corresponding record in our database, and add the text to the record. Then we just tell Datasette to add a full-text index so it can be easily searched." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def add_text_to_db(timestamp=None):\n", " \"\"\"\n", " Add harvested text files to a db.\n", " \"\"\"\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " db_path = get_db(timestamp)\n", " db = sqlite_utils.Database(db_path)\n", " # Add a text column to the db\n", " db[\"records\"].add_column(\"fulltext\", str)\n", " for row in db[\"records\"].rows:\n", " db[\"records\"].update(\n", " row[\"article_id\"],\n", " {\"fulltext\": Path(\"data\", timestamp, row[\"text\"]).read_text()},\n", " )\n", " # Make the text column full text searchable\n", " db[\"records\"].enable_fts([\"fulltext\"])\n", " db[\"records\"].optimize()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Add image links to database\n", "\n", "Our aim here is to display thumbnails of the article images alongside the article metadata. To do this we're making use of two Datasette plugins: [datasette-media](https://github.com/simonw/datasette-media) to serve the images, and [datasette-json-html](https://github.com/simonw/datasette-json-html) to insert image metadata into our database that will be automatically rendered as a HTML `img` tag. You just install the plugins via `pip`, and Datasette will automatically use them. As with the text files, the images file names include the article id, so we'll use it to link each image to its record. We'll even add a link on each thumbnail to open up the full-sized image.\n", "\n", "Note that the configuration for the `datasette-media` plugin is contained in the `metadata.yml` file in this repository." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def add_images_to_db(timestamp=None):\n", " \"\"\"\n", " Add harvested images to a db.\n", " \"\"\"\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " db_path = get_db(timestamp)\n", " db = sqlite_utils.Database(db_path)\n", " db[\"records\"].add_column(\"image_preview\", str)\n", " db[\"records\"].add_column(\"image_path\", str)\n", " proxy_url = get_proxy_url()\n", "\n", " for row in db[\"records\"].rows:\n", " full_path = Path(\"data\", timestamp, row[\"images\"].split(\"|\")[0])\n", " # Add the image file path\n", " db[\"records\"].update(row[\"article_id\"], {\"image_path\": str(full_path)})\n", " # Add some JSON with the url to the image (via the media server and Jupyter proxy)\n", " db[\"records\"].update(\n", " row[\"article_id\"],\n", " {\n", " \"image_preview\": json.dumps(\n", " {\n", " \"img_src\": f\"{proxy_url}-/media/thumbnail/{row['article_id']}\",\n", " \"href\": f\"{proxy_url}-/media/large/{row['article_id']}\",\n", " }\n", " )\n", " },\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Open Datasette\n", "\n", "Run the cell below to start up Datasette. When you see the message saying 'Uvicorn running...', click on the blue button to open Datasette in a new tab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Open Datasette\n", "create_db()\n", "add_text_to_db()\n", "add_images_to_db()\n", "open_datasette()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now what?" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Datasette provides a number of other plugins you might use to explore or visualise your data. If you'd like to make your database public, look at [Share your CSVs online using Datasette and Glitch](https://101dhhacks.net/share-searchable-csvs/)." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org) ([@wragge](https://twitter.com/wragge)) for the [GLAM Workbench project](https://github.com/glam-workbench/). Support this project by [becoming a GitHub sponsor](https://github.com/sponsors/wragge?o=esb).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "author": [ { "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "name": "Display the results of a harvest as a searchable database using Datasette" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }