{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3ca11da0",
   "metadata": {},
   "source": [
    "# Creating a 19th c. Notes & Queries Index Database\n",
    "\n",
    "Although putting textual content into a database allows us to create full text search tools over that content, a lot of work and effort went into creating the original indexes. So can we scrape the text data from the indexes and generate add the index data to a database `original_index` table to create a comprehensive searchable index?\n",
    "\n",
    "To start with, what columns might such a table need? Let's review an example index issue of *Notes & Queries*. We can get the ID for such a page by querying the metadata database table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7c91266a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlite_utils import Database\n",
    "\n",
    "db_name = \"nq_demo.db\"\n",
    "db = Database(db_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e20c692e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# And the default download dir file path\n",
    "dirname = 'ia-downloads'\n",
    "\n",
    "p = Path(dirname)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "95069bf1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sim_notes-and-queries_1849-1850_1_index'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pandas import read_sql\n",
    "q = \"SELECT id FROM metadata WHERE is_index=1 AND date LIKE '1849%' LIMIT 1\"\n",
    "\n",
    "sample_index_id = read_sql(q, db.conn)[\"id\"].iloc[0]\n",
    "sample_index_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34c7a1bc",
   "metadata": {},
   "source": [
    "*Really we should ensure we have downloaded a copy of that index document, although as we assume here, it should be in the download cache already from when we created the monolithic index PDF.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bb28067",
   "metadata": {},
   "source": [
    "## Cleaning the Text\n",
    "\n",
    "If we preview the PDF of an index issue, we see it has a regular two column structure. We can also see structure in the way that the index terms, and the subsidiary index terms, are organised:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f463f587",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"600\"\n",
       "            height=\"500\"\n",
       "            src=\"ia-downloads/sim_notes-and-queries_1849-1850_1_index/sim_notes-and-queries_1849-1850_1_index.pdf\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x10e8ac8e0>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import IFrame\n",
    "\n",
    "IFrame( (p / sample_index_id / f'{sample_index_id}.pdf').as_posix(), width=600, height=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc95841e",
   "metadata": {},
   "source": [
    "Let's also have a look at some of the raw search text for an index issue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8a1cd232",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " \n",
      " \n",
      " \n",
      " \n",
      "A.\n",
      "Apsey of St. Wandrille, 382. 486.\n",
      "Abdication of James I1., 39. 489.\n",
      "Aberdeen, Burnet prize at, 91.\n",
      "Aboriginal chambers near Tilbury, 462.\n",
      "A. (B.) on emancipation of the Jews, 475.\n",
      "Accuracy of references, 170.\n",
      "Addison's books, 212.\n",
      "Adolphus on a recent novel, 231.\n",
      "Advent bells, 121.\n",
      "Adversaria, 73. 86.\n",
      "Elfric’s colloquy, 168. 197, 232. 248. 278.\n",
      "Elian, translation of, 267. 284.\n",
      "A. (F. RB.) on Sterne’s Koran, 418.\n",
      "—— on a passage in Goldsmith, 83.\n",
      "— Queen of Hearts, 320.\n",
      "Agricola(C.), Propugnaculum anti-Pistori- anum, 203.\n",
      "A. (J. D.) on swords worn in public, 415.\n",
      "Alban’s (St.) Day, 399.\n",
      "—._, law courts at, 306.\n",
      "Albert (Le Petit), 474.\n",
      "Alchemy, metrical writings on, 60.\n",
      "Ale »xandria a (Ptolemy of), 142. 170.\n",
      "Alfred’s (King) Geography of Europe, 257. 313.\n",
      "— works, 93.\n",
      "Alicui on Bec ket’s grace-cup, 143.\n",
      "—— on Bishop Barnaby, 132.\n",
      "All Angels and St. Michael’s, feast of, 235.\n",
      "* All to-broke,” 490.\n",
      "Allusion in Friar Brackley’s sermon, 35).\n",
      "Almanack (Poor Robia’s), 470.\n",
      "Alms-basins\n"
     ]
    }
   ],
   "source": [
    "from ia_utils.download_and_extract_text import download_and_extract_text\n",
    "\n",
    "simple_index_text = download_and_extract_text(sample_index_id)\n",
    "\n",
    "print(simple_index_text[:1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8922ccf1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "A. \n",
      "\n",
      "\n",
      "Apsey of St. Wandrille, 382. 486. \n",
      "\n",
      "Abdication of James I1., 39. 489. \n",
      "\n",
      "Aberdeen, Burnet prize at, 91. \n",
      "\n",
      "Aboriginal chambers near Tilbury, 462. \n",
      "\n",
      "A. (B.) on emancipation of the Jews, 475. \n",
      "\n",
      "Accuracy of references, 170. \n",
      "\n",
      "Addison's books, 212. \n",
      "\n",
      "Adolphus on a recent novel, 231. \n",
      "\n",
      "Advent bells, 121. \n",
      "\n",
      "Adversaria, 73. 86. \n",
      "\n",
      "Elfric’s colloquy, 168. 197, 232. 248. 278. \n",
      "\n",
      "Elian, translation of, 267. 284. \n",
      "\n",
      "A. (F. RB.) on Sterne’s Koran, 418. \n",
      "\n",
      "—— on a passage in Goldsmith, 83. \n",
      "\n",
      "— Queen of Hearts, 320. \n",
      "\n",
      "Agricola(C.), Propugnaculum anti-Pistori- \n",
      "anum, 203. \n",
      "\n",
      "A. (J. D.) on swords worn in public, 415. \n",
      "\n",
      "Alban’s (St.) Day, 399. \n",
      "\n",
      "—._, law courts at, 306. \n",
      "\n",
      "Albert (Le Petit), 474. \n",
      "\n",
      "Alchemy, metrical writings on, 60. \n",
      "\n",
      "Ale »xandria a (Ptolemy of), 142. 170. \n",
      "\n",
      "Alfred’s (King) Geography of Europe, 257. \n",
      "313. \n",
      "\n",
      "\n",
      "— works, 93. \n",
      "\n",
      "Alicui on Bec ket’s grace-cup, 143. \n",
      "\n",
      "—— on Bishop Barnaby, 132. \n",
      "\n",
      "All Angels and St. Michael’s, feast of, 235. \n",
      "\n",
      "* All to-broke,” 490. \n",
      "\n",
      "Allusion in Friar\n"
     ]
    }
   ],
   "source": [
    "sample_index_text = download_and_extract_text(sample_index_id, typ=\"djvutxt\")\n",
    "\n",
    "print(sample_index_text[:1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "06be6a4f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sim_notes-and-queries_1849-1850_1_index'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_index_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cb605e7",
   "metadata": {},
   "source": [
    "Inspecting some of the documents shows that there is no guarantee that the search text correctly represents index items on a new line, although in certain documents it appears as if line breaks after each entry are provided (as in the original scanned image).\n",
    "\n",
    "There are also \"sub-elements\" on separate lines that relate to a major heading that we really need to \"fill down\" on, although there is may be no indication in the text (e.g. no series of dashes or a tab characters) to indicate the the subsidiary nature of a reference. (Note that there may be further clues in the original XML, for example, from the location of the text.) However, subsidiary entries do often appear to start with a lower case letter, so let's use that as a heuristic: *if the line starts with a lower case letter, it's a subsidiary entry*. More detailed inspection of the index search text also suggests that in some cases `-` separator characters may appear in the search text.\n",
    "\n",
    "To create a complete index, one possible approach is to:\n",
    "\n",
    "- normalise a single entry and all its subsidiary entries onto a single line;\n",
    "- parse a single entry and all its subsidiary entries into appropriate database records.\n",
    "\n",
    "Rather than consider the XML and all the additional processing that incurs, let's try to \"repair\" the document as best we can. Another thing we *could* try to exploit is the alphabetical order of entries, but let's leave that as an open question and only return to it if we find issues occurring that alphabetisation might help us address.\n",
    "\n",
    "So let's start by repairing the text and normalising the lines before considering how to parse the entries."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03f00b5b",
   "metadata": {},
   "source": [
    "### Reinserting Line Breaks\n",
    "\n",
    "If we can identify where line breaks are likely to be missing, we should be able to reinsert them.\n",
    "\n",
    "By inspection of the raw search text, it seems that we have a page number (digits), space character, and then typically the next entry start by a capital letter (subsidiary lines seem to start with a lower case character). We can perform a regular expression substitution to match this pattern and replace the space after the final page number with an end-of-line character.\n",
    "\n",
    "Some lines also start with opening quotes of various flavours (`‘` or `“` for example), or incorrectly recognised quotes rendered as a `*` character. We can also insert line breaks in advance of these:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b7e62572",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def repair_index_missing_line_breaks(text):\n",
    "    \"\"\"Attempt to repair missing line breaks.\"\"\"\n",
    "    # Add line break after page number\n",
    "    # allowing a single optional grace character at end for incorrect OCR\n",
    "    repaired_text  = re.sub(r\"([0-9].?\\s*\\.?)[\\s]+([\\(‘“\\\"'\\*A-Z])\", r'\\1\\n\\2', text)\n",
    "    \n",
    "    return repaired_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "181bec9d",
   "metadata": {},
   "source": [
    "Let's see how that looks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "dca94c65",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "A. \n",
      "\n",
      "\n",
      "Apsey of St. Wandrille, 382. 486.\n",
      "Abdication of James I1., 39. 489.\n",
      "Aberdeen, Burnet prize at, 91.\n",
      "Aboriginal chambers near Tilbury, 462.\n",
      "A. (B.) on emancipation of the Jews, 475.\n",
      "Accuracy of references, 170.\n",
      "Addison's books, 212.\n",
      "Adolphus on a recent novel, 231.\n",
      "Advent bells, 121.\n",
      "Adversaria, 73. 86.\n",
      "Elfric’s colloquy, 168. 197, 232. 248. 278.\n",
      "Elian, translation of, 267. 284.\n",
      "A. (F. RB.) on Sterne’s Koran, 418. \n",
      "\n",
      "—— on a passage in Goldsmith, 83. \n",
      "\n",
      "— Queen of Hearts, 320.\n",
      "Agricola(C.), Propugnaculum anti-Pistori- \n",
      "anum, 203.\n",
      "A. (J. D.) on swords worn in public, 415.\n",
      "Alban’s (St.) Day, 399. \n",
      "\n",
      "—._, law courts at, 306.\n",
      "Albert (Le Petit), 474.\n",
      "Alchemy, metrical writings on, 60.\n",
      "Ale »xandria a (Ptolemy of), 142. 170.\n",
      "Alfred’s (King) Geography of Europe, 257. \n",
      "313. \n",
      "\n",
      "\n",
      "— works, 93.\n",
      "Alicui on Bec ket’s grace-cup, 143. \n",
      "\n",
      "—— on Bishop Barnaby, 132.\n",
      "All Angels and St. Michael’s, feast of, 235.\n",
      "* All to-broke,” 490.\n",
      "Allusion in Friar\n"
     ]
    }
   ],
   "source": [
    "repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )\n",
    "\n",
    "print(repaired_sample_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "905e3fe6",
   "metadata": {},
   "source": [
    "### Removing Unwanted Line Breaks\n",
    "\n",
    "If what appear to be page numbers appear on the their own line, they should presumably appear as page numbers for the previous reference.\n",
    "\n",
    "In other cases, a subsidiary reference might incorrectly be place on one line, or a line might end on a comma. In such cases, we might assume the associated line breaks to be unwanted.\n",
    "\n",
    "So let's replace the line breaks in those locations with spaces, and then also replace any double spaces we might have introduced (or that were present withing the original scanned text) with a single space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "af87571b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def repair_index_unwanted_line_breaks(text):\n",
    "    \"\"\"Attempt to repair extraneous line breaks.\"\"\"\n",
    "    # Fix unwanted line end before page number\n",
    "    repaired_text  = re.sub(r\"\\n([0-9].*)\", r' \\1', text)\n",
    "    # Fix unwanted line end before subsidiary entry (initial lower case character)\n",
    "    # Identify subsidiary split with a ::: separator\n",
    "    repaired_text  = re.sub(r\"\\n([a-z].*)\", r' ::: \\1', repaired_text)\n",
    "    # Fix unwanted line break after comma\n",
    "    #repaired_text  = re.sub(r\",\\s*\\n\", r', ZZ', repaired_text)\n",
    "    \n",
    "    # Remove duplicate spaces\n",
    "    repaired_text  = re.sub(r\"  \", r' ', repaired_text)\n",
    "    \n",
    "    return repaired_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c49de6d1",
   "metadata": {},
   "source": [
    "How do things look now?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "add1b4bd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "A. \n",
      "\n",
      "\n",
      "Apsey of St. Wandrille, 382. 486.\n",
      "Abdication of James I1., 39. 489.\n",
      "Aberdeen, Burnet prize at, 91.\n",
      "Aboriginal chambers near Tilbury, 462.\n",
      "A. (B.) on emancipation of the Jews, 475.\n",
      "Accuracy of references, 170.\n",
      "Addison's books, 212.\n",
      "Adolphus on a recent novel, 231.\n",
      "Advent bells, 121.\n",
      "Adversaria, 73. 86.\n",
      "Elfric’s colloquy, 168. 197, 232. 248. 278.\n",
      "Elian, translation of, 267. 284.\n",
      "A. (F. RB.) on Sterne’s Koran, 418. \n",
      "\n",
      "—— on a passage in Goldsmith, 83. \n",
      "\n",
      "— Queen of Hearts, 320.\n",
      "Agricola(C.), Propugnaculum anti-Pistori- ::: anum, 203.\n",
      "A. (J. D.) on swords worn in public, 415.\n",
      "Alban’s (St.) Day, 399. \n",
      "\n",
      "—._, law courts at, 306.\n",
      "Albert (Le Petit), 474.\n",
      "Alchemy, metrical writings on, 60.\n",
      "Ale »xandria a (Ptolemy of), 142. 170.\n",
      "Alfred’s (King) Geography of Europe, 257. 313. \n",
      "\n",
      "\n",
      "— works, 93.\n",
      "Alicui on Bec ket’s grace-cup, 143. \n",
      "\n",
      "—— on Bishop Barnaby, 132.\n",
      "All Angels and St. Michael’s, feast of, 235.\n",
      "* All to-broke,” 490.\n",
      "Allusion in Friar\n"
     ]
    }
   ],
   "source": [
    "repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )\n",
    "repaired_sample_index = repair_index_unwanted_line_breaks( repaired_sample_index )\n",
    "\n",
    "print(repaired_sample_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54f34679",
   "metadata": {},
   "source": [
    "Inspecting the above, we see there are \"issues\" that we might be able to address, such as line entries that should be separated, based on a closer inspection of the XML returned from the scan that includes the position on the page.\n",
    "\n",
    "But at least we have something to work with."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad894987",
   "metadata": {},
   "source": [
    "### Parsing Entries and Adding Them to a Database\n",
    "\n",
    "Let's now consider how we might structure our database entries.\n",
    "\n",
    "First, we have simple \"primary\" entries, such as *Agincourt, Sir Hilary charged at, 158. 190.*\n",
    "\n",
    "We might put this into a record of the form:\n",
    "\n",
    "```json\n",
    "[{\"source_id\": id_val, \"index_term\": \"Agincourt, Sir Hilary charged at\", \"page\": 158}\n",
    "{\"source_id\": id_val, \"index_term\": \"Agincourt, Sir Hilary charged at\", \"page\": 190}]\n",
    "```\n",
    "\n",
    "The page numbers are relative to a particular volume, so we also need to be able to capture information to identify what the page numbers are with reference to. The index document filenames take the form *Notes and Queries 1875: Vol 3 Index* so we can parse out the year and volume and add these to the record too.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e68e23cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from parse import parse\n",
    "\n",
    "def get_index_metadata_from_title(title):\n",
    "    \"\"\"Get year and volume from title.\"\"\"\n",
    "    metadata = parse(\"Notes and Queries {year}: Vol {vol} Index\", title)\n",
    "    if metadata:\n",
    "        metadata = {\"year\": metadata[\"year\"], \"vol\": metadata[\"vol\"]}\n",
    "    else:\n",
    "        metadata = {\"year\": None, \"vol\": None}\n",
    "    return metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3a2562e",
   "metadata": {},
   "source": [
    "Here's how it works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "39613aca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'year': '1875', 'vol': '3'}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_index_page_title = \"Notes and Queries 1875: Vol 3 Index\"\n",
    "\n",
    "get_index_metadata_from_title(sample_index_page_title)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ffcd831",
   "metadata": {},
   "source": [
    "In the table, we might also provide a `type` column to distinguish between primary (`P`) and subsidiary (`S`) entries, along with subsidiary column which should be empty in simple cases.\n",
    "\n",
    "For a line entry such as *A. (E. H.) on baptismal superstition, 197. on curfew, at Morpeth, 312. on Duresme and Dunelm, 206.* we not the the first entry is actually a subsidiary entry, the `on` keyword identifying the subsidiarity to the main term `A. (E. H.)`.\n",
    "\n",
    "We might then desire to have partial records of the form:\n",
    "\n",
    "```json\n",
    "[{\"index_term\": \"A. (E. H.)\", \"typ\": \"S\", \"page\": 197, \"subsidiary\": \"on baptismal superstition\"},\n",
    "{\"index_term\": \"A. (E. H.)\", \"typ\": \"S\", \"page\": 312, \"subsidiary\": \"on curfew, at Morpeth\"},\n",
    "{\"index_term\": \"A. (E. H.)\", \"typ\": \"S\", \"page\": 206, \"subsidiary\": \"on Duresme and Dunelm,\"}\n",
    "]\n",
    "```\n",
    "\n",
    "Inspection of other records with subsidiary terms suggests that a comma may also be used as to denote initial subsidiarity, as or example illustrated here:\n",
    "\n",
    "`Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217.`\n",
    "\n",
    "In this case, the multiple items are based on the original term before the initial comma (this might be a hasty assumption if the key term itself includes a comma, but the we might hope for an \"on\" separator to clarify the position.\n",
    "\n",
    "*We also note in that example a possible repair we could make to the original text: removing the `word- split` hyphenation.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "2e97fca3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_text_remove_word_split_hyphenation(text):\n",
    "    \"\"\"Remove word split hyphenation.\"\"\"\n",
    "    cleaned_text = re.sub(r\"([a-z])[-—–][\\n]([a-z])\", r'\\1\\2', text)\n",
    "    \n",
    "    return cleaned_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12bb0ed0",
   "metadata": {},
   "source": [
    "Let's do a quick test of that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "3e1099e6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Berkeley (Bishop), adventures of Gaudentio di Lucca, 247.successful experiments, 217.'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_eol_hyphenation = \"Berkeley (Bishop), adventures of Gau-\\ndentio di Lucca, 247.successful experiments, 217.\"\n",
    "\n",
    "clean_text_remove_word_split_hyphenation(test_eol_hyphenation)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af7b56b8",
   "metadata": {},
   "source": [
    "So let's start by suggesting the following database record structure as something to work towards:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "fa064a78",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting ia_utils/create_db_table_index_entries.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile ia_utils/create_db_table_index_entries.py\n",
    "def create_db_table_index_entries(db, drop=True):\n",
    "    \"\"\"Create an index_entries database table and an associated full-text search table.\"\"\"\n",
    "    # If we want to remove the table completely, we can drop  it\n",
    "    table_name = \"index_entries\"\n",
    "    db[table_name].drop(ignore=True)\n",
    "    db[f\"{table_name}_fts\"].drop(ignore=True)\n",
    "\n",
    "    db[table_name].create({\n",
    "        \"source_id\": str, \n",
    "        \"year\": str,\n",
    "        \"vol\": str,\n",
    "        \"index_term\": str, \n",
    "        \"typ\": str,\n",
    "        \"subsidiary\": str,\n",
    "        \"page_num\": int\n",
    "    })\n",
    "\n",
    "    # Enable full text search\n",
    "    # This creates an extra virtual table ({table_name}_fts) to support the full text search\n",
    "    db[table_name].enable_fts([\"source_id\", \"index_term\", \"subsidiary\", \"year\", \"vol\", \"page_num\"],\n",
    "                                 create_triggers=True, tokenize=\"porter\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8f0e5a4",
   "metadata": {},
   "source": [
    "Load in the package and create the index entries database table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "0788041b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ia_utils.create_db_table_index_entries import create_db_table_index_entries\n",
    "\n",
    "create_db_table_index_entries(db)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87669346",
   "metadata": {},
   "source": [
    "We now need to consider various ways of parsing line items, including:\n",
    "    \n",
    "- extracting multiple page numbers for a single entry;\n",
    "- identifying entries that mask subsidiary terms.\n",
    "\n",
    "We have already adopted a convention of using `:::` to separate subsidiary items, so let's apply that a bit further to separate out \"on\" terms and comma separated terms. We might also have a catch all in case there are elements appearing after a page number that are perhaps rightly new entries but that we shall treat as subsidiaries.\n",
    "\n",
    "We could possibly also try to \"fudge\" page numbers that look like numbers-ish, for eexample, if there is a set of numbers that ends with an `s` or a `z`. where we might guess (possibly incorrectly) at a `5` or `2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e508b1e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def _repair_index_subsidiary_separator_line(text):\n",
    "    \"\"\"Repair entries at line level.\"\"\"\n",
    "    \n",
    "    # Very risky number substitutions\n",
    "    # We want to access \\1 so we need the alternative syntax\n",
    "    repaired_text  = re.sub(r\"([0-9])[sS]\\.?\", r'\\g<1>5', text)\n",
    "    repaired_text  = re.sub(r\"([0-9])[zZ]\\.?\", r'\\g<1>2', repaired_text)\n",
    "    \n",
    "    # Subsidiary terms based on \"on\" - this may be overly aggressive to be starting with\n",
    "    repaired_text  = re.sub(r\"([^(on)]*)( on .*)\", r'\\1 ::: \\2', repaired_text)\n",
    "    # Subsidiary terms based on dashes at start of line\n",
    "    repaired_text  = re.sub(r'^[-—–]+', r' ::: ', repaired_text)\n",
    "    # Subsidiary terms based on multiple dashes within line (unlikely to be hyphen)\n",
    "    repaired_text  = re.sub(r'[-—–]{2,}', r' ::: ', repaired_text)\n",
    "    # Subsidiary terms based on dash after a number\n",
    "    repaired_text  = re.sub(r'([0-9\\.,]+\\s*)[-—–]+', r'\\1 :::', repaired_text)\n",
    "    \n",
    "    # Subsidiary terms based on page numbers\n",
    "    repaired_text  = re.sub(r\"([0-9]\\.) *([‘“\\\"'\\*A-Za-z])\", r'\\1 ::: \\2', repaired_text)\n",
    "    # Subsidiary terms based on \"on\" - this may be overly aggressive\n",
    "    #repaired_text  = re.sub(r\"^([^:]*)( on .*)\", r'\\1 ::: \\2', repaired_text)\n",
    "    # Or only apply after a number\n",
    "    #repaired_text  = re.sub(r\"([0-9]\\.)\\s*(on)\", r'\\1 ::: \\2', repaired_text)\n",
    "    if \"::: on\" in repaired_text:\n",
    "        # Also split at start\n",
    "        repaired_text = re.sub(r\"^([^(on)]*) (on)\", r\"\\1 ::: \\2\", repaired_text)\n",
    "    # Subsidiary terms based on \",\"\n",
    "    #elif \":::\" in repaired_text:\n",
    "    # If we have numbers separated by commas, replace the commas with a .\n",
    "    repaired_text = re.sub(r'(\\s+[0-9]+)\\s*,\\s*([0-9]+)',r'\\1. \\2', repaired_text)\n",
    "    # If we have a comma before a number, separate after the number\n",
    "    # Allow a grace character\n",
    "    re.sub(r\"^([^:,]*),\\s*([0-9][0-9\\.\\s]+[A-Za-z]?)[^\\n]\", r'\\1 \\2:::', repaired_text)\n",
    "    # If we have a comma appear before a separator, separate on it\n",
    "    repaired_text  = re.sub(r\"^([^:,]*),\\s*([^0-9]+)\", r'\\1 :::\\2', repaired_text)\n",
    "\n",
    "    # Provide a catch all to add separators after what look like page numbers\n",
    "    repaired_text  = re.sub(r\"([0-9]\\s*[^:].?)\\s*([A-Za-z].*)$\", r'\\1 ::: \\2', repaired_text)\n",
    "    \n",
    "    # Remove uncaught dashes at start and end of phrase\n",
    "    repaired_text = \":::\".join([p.strip(\"-—– \") for p in repaired_text.split(\":::\")])\n",
    "    \n",
    "    return repaired_text\n",
    "\n",
    "\n",
    "def repair_index_subsidiary_separator(text):\n",
    "    \"\"\"Attempt to identify where subsidiary splits occur.\"\"\"\n",
    "    # These are applied at the line level\n",
    "    repaired_lines = [_repair_index_subsidiary_separator_line(line.strip()) for line in text.split(\"\\n\") if line]\n",
    "    \n",
    "    # Patch back any overly aggressively split lines\n",
    "    return \"\\n\".join(repaired_lines).replace(\"\\n:::\", \":::\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0694401f",
   "metadata": {},
   "source": [
    "Let's see how that works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "8aae8b29",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A.\n",
      "Apsey of St. Wandrille:::382. 486.\n",
      "Abdication of James I1.:::39. 489.\n",
      "Aberdeen:::Burnet prize at, 91.\n",
      "Aboriginal chambers near Tilbury:::462.\n",
      "A. (B.):::on emancipation of the Jews, 475.\n",
      "Accuracy of references:::170.\n",
      "Addison's books:::212.\n",
      "Adolphus:::on a recent novel, 231.\n",
      "Advent bells:::121.\n",
      "Adversaria:::73. 86.\n",
      "Elfric’s colloquy:::168. 197. 232. 248. 278.\n",
      "Elian:::translation of, 267. 284.\n",
      "A. (F. RB.):::on Sterne’s Koran, 418.::::::on a passage in Goldsmith, 83.:::Queen of Hearts, 320.\n",
      "Agricola(C.):::Propugnaculum anti-Pistori:::anum, 203.\n",
      "A. (J. D.):::on swords worn in public, 415.\n",
      "Alban’s (St.) Day:::399.:::._, law courts at, 306.\n",
      "Albert (Le Petit):::474.\n",
      "Alchemy:::metrical writings on, 60.\n",
      "Ale »xandria a (Ptolemy of):::142. 170.\n",
      "Alfred’s (King) Geography of Europe:::257. 313.:::works, 93.\n",
      "Alicui:::on Bec ket’s grace-cup, 143.::::::on Bishop Barnaby, 132.\n",
      "All Angels and St. Michael’s:::feast of, 235.\n",
      "* All to-broke:::” 490.\n",
      "Allusion in Friar\n"
     ]
    }
   ],
   "source": [
    "repaired_sample_index2 = repaired_sample_index\n",
    "repaired_sample_index2 = repair_index_subsidiary_separator(repaired_sample_index2)\n",
    "\n",
    "print(repaired_sample_index2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aaddd3c",
   "metadata": {},
   "source": [
    "And for the comma separator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "b0db99f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Berkeley (Bishop):::adventures of Gau- dentio di Lucca, 247.:::successful experiments, 217.'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_comma_subsidiary = \"Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217.\"\n",
    "\n",
    "repair_index_subsidiary_separator(text_comma_subsidiary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1396eeae",
   "metadata": {},
   "source": [
    "Having made an attempt at some subsidiary separators, we can now try to parse out the various components. At the start of the line we have the primary entry, then we may have one or more line numbers or one or more subsidiary phrases.\n",
    "\n",
    "Let's look at how to parse out page numbers. There may be one or more page numbers separated by spaces or by `.` characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "ca70f50c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a rather crude approach that just grabs all the numbers we can find\n",
    "def extract_page_numbers_from_line(text):\n",
    "    \"\"\"Extract one or more page numbers from text.\"\"\"\n",
    "    # Try to nudge things towards finding numbers at the end of the phrase\n",
    "    end_of_text = re.sub(r'^[^0-9]*([0-9\\.,\\s]*$)', r'\\1', text)\n",
    "    start_of_text = text.replace(end_of_text, '')\n",
    "    # Then just bludgeon out all the possible page numbers\n",
    "    page_numbers = re.findall(r'\\d+', end_of_text)\n",
    "    return start_of_text, page_numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2851234",
   "metadata": {},
   "source": [
    "Let's see how that works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "ad479d49",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('adventures of Gau- dentio di Lucca, ', ['247']),\n",
       " ('successful experiments, ', ['217'])]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use a test example of subsidiary elements; there is no page number in the first part\n",
    "[extract_page_numbers_from_line(t) for t in repair_index_subsidiary_separator(text_comma_subsidiary).split(\":::\")[1:]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce9037e7",
   "metadata": {},
   "source": [
    "And if there are no numbers?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "05914123",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('No numbers here', [])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_page_numbers_from_line(\"No numbers here\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "a73bb67f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_index_line(text):\n",
    "    \"\"\"Parse out elements of the index entry.\"\"\"\n",
    "    \n",
    "    # Split the entry in subsidiary parts and clean white space\n",
    "    parts = [p.strip() for p in text.split(\":::\")]\n",
    "\n",
    "    # Do we have one entry or many?\n",
    "    if len(parts) == 1:\n",
    "        # There are no subsidiary parts\n",
    "        # The first part is the main index entry\n",
    "        # from which we need to separate one or more page references\n",
    "        entry_text, page_numbers = extract_page_numbers_from_line(parts[0])\n",
    "        index_entries = [{\"index_term\": entry_text, \"typ\": \"P\",\n",
    "                         \"page_numbers\": page_numbers}]\n",
    "    else:\n",
    "        # There are subsidiary parts\n",
    "        # In this case, we get each subsidiary part and its page references\n",
    "        # Get the subsidiary parts\n",
    "        index_entries = []\n",
    "        for p in parts[1:]:\n",
    "            entry_text, page_numbers = extract_page_numbers_from_line(p)\n",
    "            subsidiary_entry = {\"index_term\": parts[0],\n",
    "                                \"subsidiary\": entry_text, \"typ\": \"S\",\n",
    "                                \"page_numbers\": page_numbers}\n",
    "            index_entries.append(subsidiary_entry)\n",
    "\n",
    "    return index_entries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "8dd96268",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'index_term': '“ Noise\" derivations of ',\n",
       "  'typ': 'P',\n",
       "  'page_numbers': ['81', '106', '138', '218', '35']}]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parse_index_line('“ Noise\" derivations of 81. 106. 138. 218. 35')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05c4b260",
   "metadata": {},
   "source": [
    "So does that work?!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ac2eca1f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'index_term': 'Aboriginal chambers near Tilbury',\n",
       "  'subsidiary': '',\n",
       "  'typ': 'S',\n",
       "  'page_numbers': ['462']}]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parse_index_line(repaired_sample_index2.split(\"\\n\")[4])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cb8ed49",
   "metadata": {},
   "source": [
    "In the above case, we have an error in that we have rolled one index entry as a subsidiary to an initial index entry because of a missing page number for the first entry.\n",
    "\n",
    "*In this case, alphabetic sorting checks across several index entries (and subsidiaries) might help us detect this error; for example, if a subsidiary term sorts between the index term and the next index term, we might guess that the subsidiary is actually a main index term.*\n",
    "\n",
    "Note that if we construct a full text search across the `index_term` and `subsidiary` columns, we are likely to get false positives but we shouldn't miss anything..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c18d13f",
   "metadata": {},
   "source": [
    "We can now try to create a complete set of records that we could upload to out database.\n",
    "\n",
    "To start with, we need the metadata, which means we need the title."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "a38c9854",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_title_from_id(db, id_val):\n",
    "    \"\"\"get the title of the issue from the database.\"\"\"\n",
    "    q = f'SELECT title FROM metadata WHERE id=\"{id_val}\"'\n",
    "\n",
    "    return read_sql(q, db.conn)[\"title\"][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "077c48d6",
   "metadata": {},
   "source": [
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "04dc3b56",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'year': ' 1849 - 1850', 'vol': '1'}"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index_base_data = get_index_metadata_from_title(get_title_from_id(db, sample_index_id))\n",
    "index_base_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5a35718",
   "metadata": {},
   "source": [
    "Now we need to separate each line item into multiple items. The `pandas` dataframe can come to out aid here, with its ability to easily split out listed items in one cell onto multiple rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "528b392f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index_term</th>\n",
       "      <th>subsidiary</th>\n",
       "      <th>typ</th>\n",
       "      <th>page_numbers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Aboriginal chambers near Tilbury</td>\n",
       "      <td></td>\n",
       "      <td>S</td>\n",
       "      <td>[462]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         index_term subsidiary typ page_numbers\n",
       "0  Aboriginal chambers near Tilbury              S        [462]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "example_subsidiary_df = pd.DataFrame(parse_index_line(repaired_sample_index2.split(\"\\n\")[4]))\n",
    "example_subsidiary_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "300a9ac3",
   "metadata": {},
   "source": [
    "We can now \"explode\" that dataframe against the lists of page numbers to get one row per item:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "2cdea110",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index_term</th>\n",
       "      <th>subsidiary</th>\n",
       "      <th>typ</th>\n",
       "      <th>page_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Aboriginal chambers near Tilbury</td>\n",
       "      <td></td>\n",
       "      <td>S</td>\n",
       "      <td>462</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         index_term subsidiary typ page_num\n",
       "0  Aboriginal chambers near Tilbury              S      462"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_subsidiary_df.explode('page_numbers').rename(columns={\"page_numbers\": \"page_num\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebc59925",
   "metadata": {},
   "source": [
    "Let's see if we can now put all those pieces together. Essentially, for each index line, we need to generate the complete set of records we want to add to the database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "ff1ab252",
   "metadata": {},
   "outputs": [],
   "source": [
    "def construct_index_records(id_val=None, text=None, metadata=None, retval=\"explode\"):\n",
    "    \"\"\"Generate a complete set of index records from original search text document.\"\"\"\n",
    "    if id_val is None and text is None:\n",
    "        return []\n",
    "\n",
    "    text = download_and_extract_text(id_val, typ=\"djvutxt\") if text is None else text\n",
    "\n",
    "    records = []\n",
    "    # Repair the text\n",
    "    repaired_text = repair_index_missing_line_breaks( text )\n",
    "    repaired_text = repair_index_unwanted_line_breaks( repaired_text )\n",
    "    repaired_text = repair_index_subsidiary_separator( repaired_text )\n",
    "\n",
    "    for line in repaired_text.split(\"\\n\"):\n",
    "        if line:\n",
    "            new_line = parse_index_line(line)\n",
    "            records.extend(new_line)\n",
    "    \n",
    "    if retval not in [\"df\", \"explode\"] or id_val is None:\n",
    "        # Return the list of dicts, without the metadata\n",
    "        return records\n",
    "\n",
    "    # WARNING - if we used provided text, the id_val and the text may actually be inconsistent\n",
    "    index_base_data = get_index_metadata_from_title(get_title_from_id(db, id_val))\n",
    "    # Generate a dataframe\n",
    "    records_df = pd.DataFrame(records)\n",
    "    \n",
    "    records_df[\"source_id\"] = id_val\n",
    "    records_df[\"year\"] = index_base_data[\"year\"]\n",
    "    records_df[\"vol\"] = index_base_data[\"vol\"]\n",
    "    \n",
    "    if retval==\"explode\":\n",
    "        return records_df.explode('page_numbers').rename(columns={\"page_numbers\": \"page_num\"})\n",
    "    elif retval==\"df\":\n",
    "        return records_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d16a9a4",
   "metadata": {},
   "source": [
    "And when we run it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "79607bb7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index_term</th>\n",
       "      <th>typ</th>\n",
       "      <th>page_num</th>\n",
       "      <th>subsidiary</th>\n",
       "      <th>source_id</th>\n",
       "      <th>year</th>\n",
       "      <th>vol</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>844</th>\n",
       "      <td>Complutensian Polygiot</td>\n",
       "      <td>S</td>\n",
       "      <td>402</td>\n",
       "      <td></td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>844</th>\n",
       "      <td>Complutensian Polygiot</td>\n",
       "      <td>S</td>\n",
       "      <td>431</td>\n",
       "      <td></td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>845</th>\n",
       "      <td>Compton Street</td>\n",
       "      <td>S</td>\n",
       "      <td>228</td>\n",
       "      <td>Soho,</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>846</th>\n",
       "      <td>Conrad of Salisbury’</td>\n",
       "      <td>P</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>847</th>\n",
       "      <td>Descriptio utriusque</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1015</th>\n",
       "      <td>D.</td>\n",
       "      <td>P</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1016</th>\n",
       "      <td>D.</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>on Lord Chatham's speech on American</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1017</th>\n",
       "      <td>D.</td>\n",
       "      <td>S</td>\n",
       "      <td>12</td>\n",
       "      <td>stamp act,</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1018</th>\n",
       "      <td>D.</td>\n",
       "      <td>S</td>\n",
       "      <td>214</td>\n",
       "      <td>iden frog,</td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1019</th>\n",
       "      <td>D.</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td></td>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>200 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  index_term typ page_num  \\\n",
       "844   Complutensian Polygiot   S      402   \n",
       "844   Complutensian Polygiot   S      431   \n",
       "845           Compton Street   S      228   \n",
       "846    Conrad of Salisbury’    P        8   \n",
       "847     Descriptio utriusque   P      NaN   \n",
       "...                      ...  ..      ...   \n",
       "1015                      D.   P      NaN   \n",
       "1016                      D.   S      NaN   \n",
       "1017                      D.   S       12   \n",
       "1018                      D.   S      214   \n",
       "1019                      D.   S      NaN   \n",
       "\n",
       "                                subsidiary  \\\n",
       "844                                          \n",
       "844                                          \n",
       "845                                 Soho,    \n",
       "846                                    NaN   \n",
       "847                                    NaN   \n",
       "...                                    ...   \n",
       "1015                                   NaN   \n",
       "1016  on Lord Chatham's speech on American   \n",
       "1017                           stamp act,    \n",
       "1018                           iden frog,    \n",
       "1019                                         \n",
       "\n",
       "                                    source_id          year vol  \n",
       "844   sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "844   sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "845   sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "846   sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "847   sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "...                                       ...           ...  ..  \n",
       "1015  sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "1016  sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "1017  sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "1018  sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "1019  sim_notes-and-queries_1849-1850_1_index   1849 - 1850   1  \n",
       "\n",
       "[200 rows x 7 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "construct_index_records(sample_index_id)[1000: 1200]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3221649b",
   "metadata": {},
   "source": [
    "It's far from ideal, but at least gives us something to work with. So let's add it to the database, and see how a search feels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "0107ddcc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Table index_entries (source_id, year, vol, index_term, typ, subsidiary, page_num)>"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db[\"index_entries\"].insert_all(construct_index_records(sample_index_id).to_dict(orient=\"records\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d839be04",
   "metadata": {},
   "source": [
    "Let's try a search:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "0b3e82c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source_id</th>\n",
       "      <th>index_term</th>\n",
       "      <th>subsidiary</th>\n",
       "      <th>year</th>\n",
       "      <th>vol</th>\n",
       "      <th>page_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>Cure for the hooping-cough</td>\n",
       "      <td></td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "      <td>397</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>G. (J.)</td>\n",
       "      <td>charm to cure the murrain in cows,</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "      <td>349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>Warts</td>\n",
       "      <td>charms for cure of,</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "      <td>349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>sim_notes-and-queries_1849-1850_1_index</td>\n",
       "      <td>Warts</td>\n",
       "      <td>charms for cure of,</td>\n",
       "      <td>1849 - 1850</td>\n",
       "      <td>1</td>\n",
       "      <td>482</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 source_id                  index_term  \\\n",
       "0  sim_notes-and-queries_1849-1850_1_index  Cure for the hooping-cough   \n",
       "1  sim_notes-and-queries_1849-1850_1_index                     G. (J.)   \n",
       "2  sim_notes-and-queries_1849-1850_1_index                       Warts   \n",
       "3  sim_notes-and-queries_1849-1850_1_index                       Warts   \n",
       "\n",
       "                            subsidiary          year vol  page_num  \n",
       "0                                        1849 - 1850   1       397  \n",
       "1  charm to cure the murrain in cows,    1849 - 1850   1       349  \n",
       "2                 charms for cure of,    1849 - 1850   1       349  \n",
       "3                 charms for cure of,    1849 - 1850   1       482  "
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "search_term = \"cure\"\n",
    "\n",
    "q = f\"\"\"\n",
    "SELECT * FROM index_entries_fts\n",
    "WHERE index_entries_fts MATCH {db.quote(search_term)};\n",
    "\"\"\"\n",
    "\n",
    "read_sql(q, db.conn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d47af4a",
   "metadata": {},
   "source": [
    "Let's create a search index over all the index issues up to 1900 excluding the cumulative indexes.\n",
    "\n",
    "First, grab all the indexes from the database and then filter to just the years we are interested in.\n",
    "\n",
    "*We really should support convenient year searching by adding a year column to the table, or creating a convenient, custom query function to handle years.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "71952d3b",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'year': 1850,\n",
       "  'id': 'sim_notes-and-queries_1849-1850_1_index',\n",
       "  'date': '1849 - 1850',\n",
       "  'datetime': '1850-03-20T00:00:00',\n",
       "  'series': None,\n",
       "  'vol': '1',\n",
       "  'iss': 'Index',\n",
       "  'title': 'Notes and Queries  1849 - 1850: Vol 1 Index',\n",
       "  'next_id': 'sim_notes-and-queries_1849-11-03_1_1',\n",
       "  'prev_id': '',\n",
       "  'is_index': 1,\n",
       "  'restricted': ''},\n",
       " {'year': 1850,\n",
       "  'id': 'sim_notes-and-queries_1850_2_index',\n",
       "  'date': '1850',\n",
       "  'datetime': '1850-03-20T00:00:00',\n",
       "  'series': None,\n",
       "  'vol': '2',\n",
       "  'iss': 'Index',\n",
       "  'title': 'Notes and Queries  1850: Vol 2 Index',\n",
       "  'next_id': 'sim_notes-and-queries_1850-06-01_2_31',\n",
       "  'prev_id': 'sim_notes-and-queries_1850-05-25_1_30',\n",
       "  'is_index': 1,\n",
       "  'restricted': ''},\n",
       " {'year': 1851,\n",
       "  'id': 'sim_notes-and-queries_1851_3_index',\n",
       "  'date': '1851',\n",
       "  'datetime': '1851-03-20T00:00:00',\n",
       "  'series': None,\n",
       "  'vol': '3',\n",
       "  'iss': 'Index',\n",
       "  'title': 'Notes and Queries  1851: Vol 3 Index',\n",
       "  'next_id': 'sim_notes-and-queries_1851-01-04_3_62',\n",
       "  'prev_id': 'sim_notes-and-queries_1850-12-28_2_61',\n",
       "  'is_index': 1,\n",
       "  'restricted': ''}]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q = \"\"\"\n",
    "SELECT * FROM (SELECT CAST(strftime(\"%Y\", datetime) AS INT) AS year, *\n",
    "    FROM metadata\n",
    "    WHERE is_index=1 AND id NOT LIKE \"%cumulative%\") WHERE year < 1900;\n",
    "\"\"\"\n",
    "\n",
    "indexes = read_sql(q, db.conn)\n",
    "\n",
    "indexes = indexes.to_dict(orient=\"records\")\n",
    "indexes[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "71d82b9c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6e2e84776c9842b8b7b9aca4b102ed87",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/99 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Download the tqdm progress bar tools\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "db[\"index_entries\"].delete_where()\n",
    "db[\"index_entries_fts\"].delete_where()\n",
    "\n",
    "# List of indexes already loaded: indexes\n",
    "\n",
    "for index_record in tqdm(indexes):\n",
    "    index_records = construct_index_records(index_record[\"id\"])\n",
    "    db[\"index_entries\"].insert_all( index_records.to_dict(orient=\"records\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccf3388e",
   "metadata": {},
   "source": [
    "And how about a search..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "e04bae8f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source_id</th>\n",
       "      <th>index_term</th>\n",
       "      <th>subsidiary</th>\n",
       "      <th>year</th>\n",
       "      <th>vol</th>\n",
       "      <th>page_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>sim_notes-and-queries_1851_4_index</td>\n",
       "      <td>B. (E. H.)</td>\n",
       "      <td>on mazer-wood and sin-eaters,</td>\n",
       "      <td>1851</td>\n",
       "      <td>4</td>\n",
       "      <td>211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>sim_notes-and-queries_1851_4_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td>notices respecting,</td>\n",
       "      <td>1851</td>\n",
       "      <td>4</td>\n",
       "      <td>211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sim_notes-and-queries_1852_6_index</td>\n",
       "      <td>E .)</td>\n",
       "      <td>on the sin-eater,</td>\n",
       "      <td>1852</td>\n",
       "      <td>6</td>\n",
       "      <td>541</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>sim_notes-and-queries_1852_6_index</td>\n",
       "      <td>I eeper ( (Alex.)</td>\n",
       "      <td>on the sin-eater,</td>\n",
       "      <td>1852</td>\n",
       "      <td>6</td>\n",
       "      <td>541</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>sim_notes-and-queries_1852_6_index</td>\n",
       "      <td>Symons Jelinger C.)</td>\n",
       "      <td>origin of sin-eater,</td>\n",
       "      <td>1852</td>\n",
       "      <td>6</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>sim_notes-and-queries_1870_6_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td>origin of the,</td>\n",
       "      <td>1870</td>\n",
       "      <td>6</td>\n",
       "      <td>450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>sim_notes-and-queries_1876_6_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td></td>\n",
       "      <td>1876</td>\n",
       "      <td>6</td>\n",
       "      <td>505</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>sim_notes-and-queries_1876_6_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td></td>\n",
       "      <td>1876</td>\n",
       "      <td>6</td>\n",
       "      <td>505</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>sim_notes-and-queries_1877_7_index</td>\n",
       "      <td>B. (A. C.)</td>\n",
       "      <td>on the Sin-eater,</td>\n",
       "      <td>1877</td>\n",
       "      <td>7</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>sim_notes-and-queries_1877_7_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td></td>\n",
       "      <td>1877</td>\n",
       "      <td>7</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>sim_notes-and-queries_1883_7_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td></td>\n",
       "      <td>1883</td>\n",
       "      <td>7</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>sim_notes-and-queries_1883_7_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td></td>\n",
       "      <td>1883</td>\n",
       "      <td>7</td>\n",
       "      <td>334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>sim_notes-and-queries_1883_8_index</td>\n",
       "      <td>Sin-eater</td>\n",
       "      <td></td>\n",
       "      <td>1883</td>\n",
       "      <td>8</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>sim_notes-and-queries_1895_8_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1895</td>\n",
       "      <td>8</td>\n",
       "      <td>288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>sim_notes-and-queries_1895_8_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1895</td>\n",
       "      <td>8</td>\n",
       "      <td>288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>sim_notes-and-queries_1895_8_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1895</td>\n",
       "      <td>8</td>\n",
       "      <td>332</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Owen (J. P.)</td>\n",
       "      <td>on sin-eaters,</td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Owen (J. P.)</td>\n",
       "      <td>on sin-eaters,</td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Sin-eaters</td>\n",
       "      <td></td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>sim_notes-and-queries_1896_9_index</td>\n",
       "      <td>Thomas (N. W.)</td>\n",
       "      <td>on sin-eaters,</td>\n",
       "      <td>1896</td>\n",
       "      <td>9</td>\n",
       "      <td>169</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             source_id           index_term  \\\n",
       "0   sim_notes-and-queries_1851_4_index           B. (E. H.)   \n",
       "1   sim_notes-and-queries_1851_4_index           Sin-eaters   \n",
       "2   sim_notes-and-queries_1852_6_index                 E .)   \n",
       "3   sim_notes-and-queries_1852_6_index    I eeper ( (Alex.)   \n",
       "4   sim_notes-and-queries_1852_6_index  Symons Jelinger C.)   \n",
       "5   sim_notes-and-queries_1870_6_index            Sin-eater   \n",
       "6   sim_notes-and-queries_1876_6_index            Sin-eater   \n",
       "7   sim_notes-and-queries_1876_6_index            Sin-eater   \n",
       "8   sim_notes-and-queries_1877_7_index           B. (A. C.)   \n",
       "9   sim_notes-and-queries_1877_7_index            Sin-eater   \n",
       "10  sim_notes-and-queries_1883_7_index            Sin-eater   \n",
       "11  sim_notes-and-queries_1883_7_index            Sin-eater   \n",
       "12  sim_notes-and-queries_1883_8_index            Sin-eater   \n",
       "13  sim_notes-and-queries_1895_8_index           Sin-eaters   \n",
       "14  sim_notes-and-queries_1895_8_index           Sin-eaters   \n",
       "15  sim_notes-and-queries_1895_8_index           Sin-eaters   \n",
       "16  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "17  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "18  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "19  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "20  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "21  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "22  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "23  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "24  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "25  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "26  sim_notes-and-queries_1896_9_index         Owen (J. P.)   \n",
       "27  sim_notes-and-queries_1896_9_index         Owen (J. P.)   \n",
       "28  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "29  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "30  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "31  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "32  sim_notes-and-queries_1896_9_index           Sin-eaters   \n",
       "33  sim_notes-and-queries_1896_9_index       Thomas (N. W.)   \n",
       "\n",
       "                        subsidiary   year vol  page_num  \n",
       "0   on mazer-wood and sin-eaters,    1851   4       211  \n",
       "1             notices respecting,    1851   4       211  \n",
       "2               on the sin-eater,    1852   6       541  \n",
       "3               on the sin-eater,    1852   6       541  \n",
       "4            origin of sin-eater,    1852   6       300  \n",
       "5                  origin of the,    1870   6       450  \n",
       "6                                    1876   6       505  \n",
       "7                                    1876   6       505  \n",
       "8               on the Sin-eater,    1877   7        14  \n",
       "9                                    1877   7        14  \n",
       "10                                   1883   7        25  \n",
       "11                                   1883   7       334  \n",
       "12                                   1883   8       255  \n",
       "13                                   1895   8       288  \n",
       "14                                   1895   8       288  \n",
       "15                                   1895   8       332  \n",
       "16                                   1896   9       169  \n",
       "17                                   1896   9       296  \n",
       "18                                   1896   9       110  \n",
       "19                                   1896   9       111  \n",
       "20                                   1896   9       109  \n",
       "21                                   1896   9       169  \n",
       "22                                   1896   9       236  \n",
       "23                                   1896   9       296  \n",
       "24                                   1896   9       111  \n",
       "25                                   1896   9       110  \n",
       "26                 on sin-eaters,    1896   9       109  \n",
       "27                 on sin-eaters,    1896   9       236  \n",
       "28                                   1896   9       110  \n",
       "29                                   1896   9       109  \n",
       "30                                   1896   9       169  \n",
       "31                                   1896   9       236  \n",
       "32                                   1896   9       296  \n",
       "33                 on sin-eaters,    1896   9       169  "
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "search_term = \"sin eater\"\n",
    "\n",
    "q = f\"\"\"\n",
    "SELECT * FROM index_entries_fts\n",
    "WHERE index_entries_fts MATCH {db.quote(search_term)};\n",
    "\"\"\"\n",
    "\n",
    "read_sql(q, db.conn)[:50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "87dd7b1a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source_id</th>\n",
       "      <th>index_term</th>\n",
       "      <th>subsidiary</th>\n",
       "      <th>year</th>\n",
       "      <th>vol</th>\n",
       "      <th>page_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>sim_notes-and-queries_1869_4_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1869</td>\n",
       "      <td>4</td>\n",
       "      <td>508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>sim_notes-and-queries_1869_4_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1869</td>\n",
       "      <td>4</td>\n",
       "      <td>508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sim_notes-and-queries_1869_4_index</td>\n",
       "      <td>Higson (John)</td>\n",
       "      <td>on Boggarts and Feorin,</td>\n",
       "      <td>1869</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>156</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>287</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts and Feorin</td>\n",
       "      <td></td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>517</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Bowker (James)</td>\n",
       "      <td>on Boggarts, Feorin, &amp;c.</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts</td>\n",
       "      <td>Feorin,</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>287</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Davies (Wm.)</td>\n",
       "      <td>on Boggarts, Feorin, &amp;c.,</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Hermentrude</td>\n",
       "      <td>on Boggarts,</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Boggarts</td>\n",
       "      <td>Feorin, &amp;c.,</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>156</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Riley (H. T.)</td>\n",
       "      <td>on Boggarts, Feorin, &amp;c.,</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>sim_notes-and-queries_1870_5_index</td>\n",
       "      <td>Smith (W. J. B.)</td>\n",
       "      <td>on Boggarts, Feorin, &amp;e.,</td>\n",
       "      <td>1870</td>\n",
       "      <td>5</td>\n",
       "      <td>317</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>sim_notes-and-queries_1895_8_index</td>\n",
       "      <td>: Boggart, 85</td>\n",
       "      <td>ah by a sheep,</td>\n",
       "      <td>1895</td>\n",
       "      <td>8</td>\n",
       "      <td>170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>sim_notes-and-queries_1895_8_index</td>\n",
       "      <td>Boggart=ghost</td>\n",
       "      <td></td>\n",
       "      <td>1895</td>\n",
       "      <td>8</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>sim_notes-and-queries_1895_8_index</td>\n",
       "      <td>Boggart=ghost</td>\n",
       "      <td></td>\n",
       "      <td>1895</td>\n",
       "      <td>8</td>\n",
       "      <td>255</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             source_id           index_term  \\\n",
       "0   sim_notes-and-queries_1869_4_index  Boggarts and Feorin   \n",
       "1   sim_notes-and-queries_1869_4_index  Boggarts and Feorin   \n",
       "2   sim_notes-and-queries_1869_4_index        Higson (John)   \n",
       "3   sim_notes-and-queries_1870_5_index  Boggarts and Feorin   \n",
       "4   sim_notes-and-queries_1870_5_index  Boggarts and Feorin   \n",
       "5   sim_notes-and-queries_1870_5_index  Boggarts and Feorin   \n",
       "6   sim_notes-and-queries_1870_5_index  Boggarts and Feorin   \n",
       "7   sim_notes-and-queries_1870_5_index  Boggarts and Feorin   \n",
       "8   sim_notes-and-queries_1870_5_index  Boggarts and Feorin   \n",
       "9   sim_notes-and-queries_1870_5_index       Bowker (James)   \n",
       "10  sim_notes-and-queries_1870_5_index             Boggarts   \n",
       "11  sim_notes-and-queries_1870_5_index         Davies (Wm.)   \n",
       "12  sim_notes-and-queries_1870_5_index          Hermentrude   \n",
       "13  sim_notes-and-queries_1870_5_index             Boggarts   \n",
       "14  sim_notes-and-queries_1870_5_index        Riley (H. T.)   \n",
       "15  sim_notes-and-queries_1870_5_index     Smith (W. J. B.)   \n",
       "16  sim_notes-and-queries_1895_8_index        : Boggart, 85   \n",
       "17  sim_notes-and-queries_1895_8_index        Boggart=ghost   \n",
       "18  sim_notes-and-queries_1895_8_index        Boggart=ghost   \n",
       "\n",
       "                    subsidiary   year vol  page_num  \n",
       "0                                1869   4       508  \n",
       "1                                1869   4       508  \n",
       "2     on Boggarts and Feorin,    1869   4         5  \n",
       "3                                1870   5        23  \n",
       "4                                1870   5       156  \n",
       "5                                1870   5       216  \n",
       "6                                1870   5       287  \n",
       "7                                1870   5       365  \n",
       "8                                1870   5       517  \n",
       "9    on Boggarts, Feorin, &c.    1870   5       365  \n",
       "10                    Feorin,    1870   5       287  \n",
       "11  on Boggarts, Feorin, &c.,    1870   5       216  \n",
       "12               on Boggarts,    1870   5        23  \n",
       "13               Feorin, &c.,    1870   5       156  \n",
       "14  on Boggarts, Feorin, &c.,    1870   5       216  \n",
       "15  on Boggarts, Feorin, &e.,    1870   5       317  \n",
       "16             ah by a sheep,    1895   8       170  \n",
       "17                               1895   8        85  \n",
       "18                               1895   8       255  "
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "search_term = \"boggart\"\n",
    "\n",
    "q = f\"\"\"\n",
    "SELECT * FROM index_entries_fts\n",
    "WHERE index_entries_fts MATCH {db.quote(search_term)};\n",
    "\"\"\"\n",
    "\n",
    "read_sql(q, db.conn)[:50]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e082de29",
   "metadata": {},
   "source": [
    "## Add the Index Tables to the Full Database\n",
    "\n",
    "Let's also add the index tables to our full database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "9adb9e8a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7a7dada260b94a78acff4fb89aa165d9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/99 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "db_name_full = \"full_nq.db\"\n",
    "db_full = Database(db_name_full)\n",
    "\n",
    "create_db_table_index_entries(db_full)\n",
    "\n",
    "db_full[\"index_entries\"].delete_where()\n",
    "db_full[\"index_entries_fts\"].delete_where()\n",
    "\n",
    "# List of indexes already loaded: indexes\n",
    "\n",
    "for index_record in tqdm(indexes):\n",
    "    index_records = construct_index_records(index_record[\"id\"])\n",
    "    db_full[\"index_entries\"].insert_all( index_records.to_dict(orient=\"records\"))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}