{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7a17409d-6978-4940-94d4-38583fa343c2",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# Create a searchable database from issues of a NED periodical\n",
    "\n",
    "Trove contains thousands of publications submitted through the [National eDeposit Service](https://ned.gov.au/ned/) (NED). When I last checked, this included 8,572 periodicals comprising a total of 179,510 issues! Amongst the periodicals are many local and community newsletters that provide a valuable record of everyday life – often filling the gap left by the demise of local newspapers. Some of these periodicals have access constraints, but most can be viewed online in Trove. However, unlike Trove's own digitised periodicals or newspapers, the contents of these publications don't appear in Trove search results. If the NED publications are born-digital PDFs containing a text layer, the content of each issue can be searched individually using the built-in PDF viewer. But there's no way of searching for terms across every issue of a NED periodical in Trove. \n",
    "\n",
    "This greatly limits the potential research uses of NED periodicals. This notebook helps to open the content of these periodicals by creating a workflow with the following steps:\n",
    "\n",
    "- download the PDFs of every issue in a NED periodical\n",
    "- extract the text for each page in the PDFs\n",
    "- create an SQLite database containing metadata and text for each page\n",
    "- build a full-text index on the text content to allow easy searching\n",
    "- explore the database using Datasette"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6476a9e3-0954-473a-b58d-e52db0eef1d7",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Let's import the libraries we need.\n",
    "import json\n",
    "import re\n",
    "import shutil\n",
    "import time\n",
    "from io import BytesIO\n",
    "from pathlib import Path\n",
    "from zipfile import ZipFile\n",
    "\n",
    "import arrow\n",
    "import fitz\n",
    "import pandas as pd\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "from IPython.display import HTML, display\n",
    "from jupyter_server import serverapp\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from slugify import slugify\n",
    "from sqlite_utils import Database\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d065707-17e6-4fae-a3a1-cb7815176ad8",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Harvest title and issue metadata\n",
    "\n",
    "The first step is to harvest information about all the issues of a periodical. The Trove API doesn't provide this data, so you have to scrape it from the web interface. There are two stages:\n",
    "\n",
    "- scrape a list of issues from the periodicals 'Browse this collection' pages\n",
    "- supplement the individual metadata for each issue by scraping data from the Trove digital resource viewer\n",
    "\n",
    "These methods are documented in the *Trove Data Guide*. See:\n",
    "\n",
    "- [Get a list of items from a digitised collection](https://tdg.glam-workbench.net/other-digitised-resources/how-to/get-collection-items.html)\n",
    "- [Extract additional metadata from the digitised resource viewer](https://tdg.glam-workbench.net/other-digitised-resources/how-to/extract-embedded-metadata.html)\n",
    "\n",
    "To get started you need the `nla.obj` identifier of the publication. This is in the url of the periodicals 'collection' page in the Trove digital resource viewer. For example, here's the collection page of [The Triangle](https://nla.gov.au/nla.obj-3121636851). The url of this page is: `https://nla.gov.au/nla.obj-3121636851`, so the identifier is `nla.obj-3121636851`.\n",
    "\n",
    "You can set the identifier in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "7cf0070b-b1ff-4698-905b-71616548e7de",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "title_id = \"nla.obj-3121636851\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "77ba2e12-9fae-4f74-ba42-e92eb2fc2e69",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_metadata(id):\n",
    "    \"\"\"\n",
    "    Extract work data in a JSON string from the work's HTML page.\n",
    "    \"\"\"\n",
    "    if not id.startswith(\"http\"):\n",
    "        id = \"https://nla.gov.au/\" + id\n",
    "    response = s.get(id)\n",
    "    try:\n",
    "        work_data = re.search(\n",
    "            r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n",
    "        ).group(1)\n",
    "    except AttributeError:\n",
    "        work_data = \"{}\"\n",
    "    return json.loads(work_data)\n",
    "\n",
    "\n",
    "def get_iso_date(date):\n",
    "    if date:\n",
    "        iso_date = arrow.get(date, \"ddd, D MMM YYYY\").format(\"YYYY\")\n",
    "    else:\n",
    "        # So we can use this field in facets\n",
    "        iso_date = \"0\"\n",
    "    return iso_date\n",
    "\n",
    "\n",
    "def get_issues(parent_id):\n",
    "    \"\"\"\n",
    "    Get the ids of issues that are children of the current record.\n",
    "    \"\"\"\n",
    "    start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n",
    "    # The initial startIdx value\n",
    "    start = 0\n",
    "    # Number of results per page\n",
    "    n = 20\n",
    "    parts = []\n",
    "    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n",
    "    while n == 20:\n",
    "        # Get the browse page\n",
    "        response = s.get(start_url.format(parent_id, start))\n",
    "        # Beautifulsoup turns the HTML into an easily navigable structure\n",
    "        soup = BeautifulSoup(response.text, \"lxml\")\n",
    "        # Find all the divs containing issue details and loop through them\n",
    "        details = soup.find_all(class_=\"l-item-info\")\n",
    "        for detail in details:\n",
    "            title = detail.find(\"h3\")\n",
    "            if title:\n",
    "                issue_id = title.parent[\"href\"].strip(\"/\")\n",
    "            else:\n",
    "                issue_id = detail.find(\"a\")[\"href\"].strip(\"/\")\n",
    "            # Get the issue id\n",
    "            parts.append(issue_id)\n",
    "        # Increment the startIdx\n",
    "        start += n\n",
    "        # Set n to the number of results on the current page\n",
    "        n = len(details)\n",
    "    return parts\n",
    "\n",
    "\n",
    "def harvest_issues(title_id):\n",
    "    data_dir = Path(\"downloads\", title_id)\n",
    "    data_dir.mkdir(exist_ok=True, parents=True)\n",
    "    issues = get_issues(title_id)\n",
    "    with Path(data_dir, f\"{title_id}-issues.ndjson\").open(\"w\") as ndjson_file:\n",
    "        for issue_id in issues:\n",
    "            metadata = get_metadata(issue_id)\n",
    "            try:\n",
    "                issue = {\n",
    "                    \"id\": metadata[\"pid\"],\n",
    "                    \"title_id\": title_id,\n",
    "                    \"title\": metadata[\"title\"],\n",
    "                    \"description\": metadata.get(\"subUnitNo\", \"\"),\n",
    "                    \"date\": get_iso_date(metadata.get(\"issueDate\", None)),\n",
    "                    \"url\": f\"https://nla.gov.au/{metadata['pid']}\",\n",
    "                    \"ebook_type\": metadata.get(\"ebookType\", \"\"),\n",
    "                    \"access_conditions\": metadata.get(\"accessConditions\", \"\"),\n",
    "                    \"copyright_policy\": metadata.get(\"copyrightPolicy\", \"\"),\n",
    "                }\n",
    "            except KeyError:\n",
    "                print(title_id)\n",
    "            else:\n",
    "                ndjson_file.write(f\"{json.dumps(issue)}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "5078ef9f-ea60-4a96-b7d9-784e79e11677",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "harvest_issues(title_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dc7208b-9f97-44fe-8d0f-e7a6ce5083db",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "The harvesting process creates an `ndjson` data file in the `downloads/[title id]` directory. This file includes details of every issue."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42565db0-6319-471b-95bb-2e84d8968203",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Download the PDFs\n",
    "\n",
    "The metadata harvested above includes the `nla.obj` identifiers for every issue. You can use these identifiers to download the PDFs.\n",
    "\n",
    "The download method is documented in the Trove Data Guide's [HOW TO Get text, images, and PDFs using Trove’s download link](https://tdg.glam-workbench.net/other-digitised-resources/how-to/get-downloads.html). However, the parameters used for NED publications are a bit different to those used with digitised resources. Instead of `pdf` you need to set the `downloadOption` to `eBook`. The `firstPage` and `lastPage` parameters are both set to `-1`. Using these settings you can download a zip file that contains the original PDF file.\n",
    "\n",
    "The PDFs are saved in the `downloads/[title id]/pdfs` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ab3e248c-bd56-4c97-9012-6e1e134e2ded",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def download_pdfs(title_id):\n",
    "    data_dir = Path(\"downloads\", title_id)\n",
    "    pdf_dir = Path(data_dir, \"pdfs\")\n",
    "    pdf_dir.mkdir(exist_ok=True, parents=True)\n",
    "    for id in pd.read_json(Path(data_dir, f\"{title_id}-issues.ndjson\"), lines=True)[\n",
    "        \"id\"\n",
    "    ].to_list():\n",
    "        pdf_file = f\"{id}-1.pdf\"\n",
    "        pdf_path = Path(pdf_dir, pdf_file)\n",
    "        if not pdf_path.exists():\n",
    "            # This url downloads a zip file containing the PDF,\n",
    "            # note that downloadOption needs to be set to eBook\n",
    "            download_url = f\"https://nla.gov.au/{id}/download?downloadOption=eBook&firstPage=-1&lastPage=-1\"\n",
    "            response = s.get(download_url)\n",
    "            zipped = ZipFile(BytesIO(response.content))\n",
    "            zipped.extract(pdf_file, path=pdf_dir)\n",
    "            time.sleep(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "afb48c11-4494-4f7c-93de-cf61fea085b0",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "download_pdfs(title_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d033417-2d41-4ae4-aac5-39035ffbe972",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Extract text from the PDFs\n",
    "\n",
    "Once you have the PDFs you can use [PyMuPDF](https://github.com/pymupdf/pymupdf) to loop through each page, extracting the text content, and saving it to a separate file.\n",
    "\n",
    "The text files are saved in the `downloads/[title id]/text` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "88536d4f-632c-4f12-8fff-f4bbe084c7d2",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def clean_text(text):\n",
    "    \"\"\"\n",
    "    Remove linebreaks and extra whitespace from text.\n",
    "    \"\"\"\n",
    "    text = text.replace(\"\\n\", \" \")\n",
    "    text = re.sub(r\"\\s+\", \" \", text)\n",
    "    return text.encode()\n",
    "\n",
    "\n",
    "def extract_text(title_id):\n",
    "    \"\"\"\n",
    "    Extract text from each page of the PDFs and save as separate text files.\n",
    "    \"\"\"\n",
    "    data_dir = Path(\"downloads\", title_id)\n",
    "    pdf_dir = Path(data_dir, \"pdfs\")\n",
    "    text_dir = Path(data_dir, \"text\")\n",
    "    text_dir.mkdir(exist_ok=True, parents=True)\n",
    "    for pdf_file in Path(pdf_dir).glob(\"*.pdf\"):\n",
    "        pid = pdf_file.stem[:-2]\n",
    "        doc = fitz.open(pdf_file)\n",
    "        for i, page in enumerate(doc):\n",
    "            text_path = Path(text_dir, f\"{pid}-p{i+1}.txt\")\n",
    "            text = page.get_text()\n",
    "            Path(text_path).write_bytes(clean_text(text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "cb7952d0-3df1-4eb6-b3b8-97207cdfbbc8",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "extract_text(title_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7393a948-498d-49a8-a956-5fed9915f48e",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Load text and metadata into a SQLite database\n",
    "\n",
    "Finally, you can use [sqlite-utils](https://sqlite-utils.datasette.io/en/stable/) to create a SQLite database and load the metadata and text from each page. The code below also creates a `metadata.json` file that can be used to configure Datasette to help you explore the data.\n",
    "\n",
    "The database and `metadata.json` files are saved in the `downloads/[title id]/datasette` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "c3e8bc2c-80ac-4442-9d4b-97a21d519cc6",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def save_metadata(title_id, df, db_dir):\n",
    "    \"\"\"\n",
    "    Create a netadata.json file for use with Datasette.\n",
    "    \"\"\"\n",
    "    title = df[\"title\"].iloc[0]\n",
    "    description = f\"{df.shape[0]} issues from {df['date'].min()} to {df['date'].max()}\"\n",
    "    sql = \"SELECT issue_id, issue_details, issue_date, page_number, snippet(pages_fts, -1, '<span class=\\\"has-text-warning\\\">', '</span>', '...', 50) AS snippet, bm25(pages_fts) AS rank FROM pages JOIN pages_fts ON pages.rowid=pages_fts.rowid WHERE pages_fts match :query AND issue_date >= :start_date AND issue_date <= :end_date ORDER BY rank\"\n",
    "    metadata = {\n",
    "        \"title\": \"Search NED periodicals\",\n",
    "        \"databases\": {\n",
    "            slugify(title): {\n",
    "                \"title\": title.strip(\".\"),\n",
    "                \"description\": description,\n",
    "                \"source_url\": f\"https://nla.gov.au/{title_id}\",\n",
    "                \"queries\": {\n",
    "                    \"search\": {\n",
    "                        \"sql\": sql,\n",
    "                        \"hide_sql\": True,\n",
    "                        \"searchmode\": \"raw\",\n",
    "                        \"title\": f\"Search {title.strip('.')}\",\n",
    "                        \"description\": description,\n",
    "                        \"source_url\": f\"https://nla.gov.au/{title_id}\",\n",
    "                    }\n",
    "                },\n",
    "            }\n",
    "        },\n",
    "    }\n",
    "    Path(db_dir, \"metadata.json\").write_text(json.dumps(metadata, indent=4))\n",
    "    return title\n",
    "\n",
    "\n",
    "def create_db(title_id):\n",
    "    data_dir = Path(\"downloads\", title_id)\n",
    "    db_dir = Path(data_dir, \"datasette\")\n",
    "    db_dir.mkdir(exist_ok=True)\n",
    "    df = pd.read_json(Path(data_dir, f\"{title_id}-issues.ndjson\"), lines=True)\n",
    "    title = save_metadata(title_id, df, db_dir)\n",
    "    text_dir = Path(data_dir, \"text\")\n",
    "    db = Database(Path(db_dir, f\"{slugify(title)}.db\"), recreate=True)\n",
    "    for row in df.itertuples():\n",
    "        for page in sorted(Path(text_dir).glob(f\"{row.id}-p*.txt\")):\n",
    "            page_num = re.search(r\"-p(\\d+).txt\", page.name).group(1)\n",
    "            data = {\n",
    "                \"issue_id\": row.id,\n",
    "                \"issue_details\": row.description,\n",
    "                \"issue_date\": row.date,\n",
    "                \"page_number\": page_num,\n",
    "                \"text\": page.read_text(),\n",
    "            }\n",
    "            db[\"pages\"].insert(data, pk=[\"issue_id\", \"page_number\"])\n",
    "    # Index the text\n",
    "    db[\"pages\"].enable_fts([\"text\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "a52900be-0fb7-4b1c-9008-7e1bc77014b0",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "create_db(title_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "619dcdf9-6be9-4506-bd07-f42ab5fba73b",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Exploring the SQLite database\n",
    "\n",
    "You can open the database created above with any SQLite client, however, I think the easiest option for data exploration is Datasette. \n",
    "\n",
    "The `metadata.json` file created above defines a 'canned query' that generates a custom search page within Datasette. This page works best with a custom template I've developed that adds a few enhancements such as date facets and result highlighting. To start up Datasette with this custom search page, there are two options.\n",
    "\n",
    "### Run Datasette within the current environment\n",
    "\n",
    "Datasette is already installed within the current Jupyter Lab environment, but starting it up requires a bit of fiddling with proxies and urls. The cell below handles all of that, and generates a big, blue 'View in Datasette' button to click. It also downloads the custom template to your data directory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "387d782b-8974-4b0a-80d8-28664d78cbc4",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def download_template(title_id):\n",
    "    db_dir = Path(\"downloads\", title_id, \"datasette\")\n",
    "    if not Path(db_dir, \"templates\").exists():\n",
    "        response = requests.get(\n",
    "            \"https://github.com/GLAM-Workbench/datasette-lite-search/raw/refs/heads/main/templates.zip\"\n",
    "        )\n",
    "        zipped = ZipFile(BytesIO(response.content))\n",
    "        zipped.extractall(path=db_dir)\n",
    "\n",
    "\n",
    "def get_proxy_url():\n",
    "    # Get current running servers\n",
    "    servers = serverapp.list_running_servers()\n",
    "    base_url = next(servers)[\"base_url\"]\n",
    "    # Create a base url for Datasette using the proxy path\n",
    "    proxy_url = f\"{base_url}proxy/8001/\"\n",
    "    return proxy_url\n",
    "\n",
    "\n",
    "def open_datasette(timestamp=None):\n",
    "    \"\"\"\n",
    "    This gets the base url of the currently running notebook. It then uses this url to\n",
    "    construct a link to your Datasette instance. Finally it creates a button to open up a new tab to view your database.\n",
    "    \"\"\"\n",
    "    download_template(title_id)\n",
    "    data_dir = Path(\"downloads\", title_id)\n",
    "    db_dir = Path(data_dir, \"datasette\")\n",
    "    db_path = next(Path(db_dir).glob(\"*.db\"))\n",
    "    proxy_url = get_proxy_url()\n",
    "    # Display a link to Datasette\n",
    "    display(\n",
    "        HTML(\n",
    "            f'<p><a style=\"width: 200px; display: block; border: 1px solid #307fc1; background-color: #1976d2; color: #ffffff; padding: 10px; text-align: center; font-weight: bold;\"href=\"{proxy_url}\">View in Datasette</a> (Click on the stop button in the top menu bar to close the Datasette server)</p>'\n",
    "        )\n",
    "    )\n",
    "    # Launch Datasette\n",
    "\n",
    "\n",
    "    !datasette {db_path} --port 8001  --setting base_url {proxy_url} --template-dir {db_dir}/templates --metadata {db_dir}/metadata.json --config truncate_cells_html:100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e13539f1-47b3-407f-9eaa-c59a40c7c4fd",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "open_datasette()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0459b31-2750-41ca-a26e-ee9db0810591",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "### Use Datasette-Lite\n",
    "\n",
    "If you don't want to install new software or run Jupyter Lab to explore your data, you can use Datasette-Lite. Datasette-Lite is a version of Datasette that doesn't require a special server. You just point your browser at a static web page and Datasette is installed and run within your browser. \n",
    "\n",
    "I've created a [custom Datasette-Lite repository](https://github.com/GLAM-Workbench/datasette-lite-search) with everything you need, but first you have to save your database and `metadata.json` file somewhere on the web, so that Datasette-Lite can access them. I'd suggest creating a GitHub repository, and uploading them there:\n",
    "\n",
    "- First download the files from Jupyter Lab. As noted above, the database and `metadata.json` files are saved in the `downloads/[title id]/datasette` directory. Use Jupyter Labs's file browser to find the files, then right click on them and select 'Download'.\n",
    "- If you don’t have one already, you’ll need to [create a GitHub user account](https://docs.github.com/en/get-started/start-your-journey/creating-an-account-on-github) – the standard free, personal account is fine.\n",
    "- Once you're logged into GitHub, [create a new repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-new-repository). Make sure that you set the 'visibility' to public.\n",
    "- Go to the new repository and [upload](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository) the database and `metadata.json` files.\n",
    "\n",
    "Once your files are publicly available on the web, you can construct a url to load them in Datasette-Lite:\n",
    "\n",
    "- The base url for the Datasette-Lite page is `https://glam-workbench.net/datasette-lite-search/`.\n",
    "- In your new GitHub repository, right-click on the database file name and select 'copy link'. Add `?csv=[url of database]` to the base url.\n",
    "- Now, right-click on the metadata file name and select 'copy link'. Add `&metadata=[url of metadata]` to the url.\n",
    "- That's it! Copy the finished url and load it in your browser.\n",
    "\n",
    "Here's an example using [The Triangle](https://nla.gov.au/nla.obj-3121636851):\n",
    "\n",
    "- The files are saved in [this repository](https://github.com/GLAM-Workbench/trove-ned-periodicals).\n",
    "- The url of the database file is: `https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/the-triangle.db`\n",
    "- The url of the metadata file is: `https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/metadata.json`\n",
    "- Put it all together and you get: <https://glam-workbench.net/datasette-lite-search/?url=https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/the-triangle.db&metadata=https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/metadata.json> Click on this link to search *The Triangle* using Datasette-Lite.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "81584161-fe32-4a8c-91ed-9700bb4a92c9",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# IGNORE THIS CELL -- FOR TESTING ONLY\n",
    "\n",
    "title_id = \"nla.obj-971998493\"\n",
    "harvest_issues(title_id)\n",
    "download_pdfs(title_id)\n",
    "extract_text(title_id)\n",
    "create_db(title_id)\n",
    "db_dir = Path(\"downloads\", title_id, \"datasette\")\n",
    "db_path = next(Path(db_dir).glob(\"*.db\"))\n",
    "db = Database(db_path)\n",
    "shutil.rmtree(Path(\"downloads\", title_id))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a67270b-7389-4e6e-919c-cf98b69a75fa",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "rocrate": {
   "action": [
    {
     "description": "Examples of SQLite databases created by extracting and indexing the text content of NED periodicals. Includes custom metadata files for Datasette configuration.",
     "isPartOf": "https://github.com/GLAM-Workbench/trove-ned-periodicals",
     "local_path": "downloads/nla.obj-3121636851/datasette",
     "mainEntityOfPage": "https://glam-workbench.net/trove-ned/the-triangle/",
     "name": "Searchable database of content from The Triangle",
     "query": "nla.obj-3121636851",
     "result": [
      {
       "url": "https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/the-triangle.db"
      },
      {
       "url": "https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/metadata.json"
      }
     ],
     "workExample": [
      {
       "name": "Explore using Datasette",
       "url": "https://glam-workbench.net/datasette-lite-search/?url=https%3A%2F%2Fgithub.com%2FGLAM-Workbench%2Ftrove-ned-periodicals%2Fblob%2Fmain%2Fdbs%2Fthe-triangle%2Fthe-triangle.db&install=datasette-template-sql&metadata=https://github.com/GLAM-Workbench/trove-ned-periodicals/blob/main/dbs/the-triangle/metadata.json"
      }
     ]
    }
   ],
   "author": [
    {
     "mainEntityOfPage": "https://timsherratt.au",
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "description": "Trove contains thousands of publications submitted through the National eDeposit Service (NED). When I last checked, this included 8,572 periodicals comprising a total of 179,510 issues! Amongst the periodicals are many local and community newsletters that provide a valuable record of everyday life – often filling the gap left by the demise of local newspapers. Some of these periodicals have access constraints, but most can be viewed online in Trove. However, unlike Trove's own digitised periodicals or newspapers, the contents of these publications don't appear in Trove search results. This notebook provides a workflow through which you can extract text from all the issues of a NED publication and build a fulltext-search-enabled database for exploration of its contents.",
   "mainEntityOfPage": "https://glam-workbench.net/trove-ned/create-searchable-database/",
   "name": "Create a searchable database from issues of a NED periodical"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}