{ "cells": [ { "cell_type": "markdown", "id": "3ca11da0", "metadata": {}, "source": [ "# Creating a 19th c. Notes & Queries Index Database\n", "\n", "Although putting textual content into a database allows us to create full text search tools over that content, a lot of work and effort went into creating the original indexes. So can we scrape the text data from the indexes and generate add the index data to a database `original_index` table to create a comprehensive searchable index?\n", "\n", "To start with, what columns might such a table need? Let's review an example index issue of *Notes & Queries*. We can get the ID for such a page by querying the metadata database table:" ] }, { "cell_type": "code", "execution_count": 1, "id": "7c91266a", "metadata": {}, "outputs": [], "source": [ "from sqlite_utils import Database\n", "\n", "db_name = \"nq_demo.db\"\n", "db = Database(db_name)" ] }, { "cell_type": "code", "execution_count": 2, "id": "e20c692e", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# And the default download dir file path\n", "dirname = 'ia-downloads'\n", "\n", "p = Path(dirname)" ] }, { "cell_type": "code", "execution_count": 3, "id": "95069bf1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sim_notes-and-queries_1849-1850_1_index'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pandas import read_sql\n", "q = \"SELECT id FROM metadata WHERE is_index=1 AND date LIKE '1849%' LIMIT 1\"\n", "\n", "sample_index_id = read_sql(q, db.conn)[\"id\"].iloc[0]\n", "sample_index_id" ] }, { "cell_type": "markdown", "id": "34c7a1bc", "metadata": {}, "source": [ "*Really we should ensure we have downloaded a copy of that index document, although as we assume here, it should be in the download cache already from when we created the monolithic index PDF.*" ] }, { "cell_type": "markdown", "id": "4bb28067", "metadata": {}, "source": [ "## Cleaning the Text\n", "\n", "If we preview the PDF of an index issue, we see it has a regular two column structure. We can also see structure in the way that the index terms, and the subsidiary index terms, are organised:" ] }, { "cell_type": "code", "execution_count": 4, "id": "f463f587", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import IFrame\n", "\n", "IFrame( (p / sample_index_id / f'{sample_index_id}.pdf').as_posix(), width=600, height=500)" ] }, { "cell_type": "markdown", "id": "fc95841e", "metadata": {}, "source": [ "Let's also have a look at some of the raw search text for an index issue:" ] }, { "cell_type": "code", "execution_count": 5, "id": "8a1cd232", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", " \n", " \n", " \n", "A.\n", "Apsey of St. Wandrille, 382. 486.\n", "Abdication of James I1., 39. 489.\n", "Aberdeen, Burnet prize at, 91.\n", "Aboriginal chambers near Tilbury, 462.\n", "A. (B.) on emancipation of the Jews, 475.\n", "Accuracy of references, 170.\n", "Addison's books, 212.\n", "Adolphus on a recent novel, 231.\n", "Advent bells, 121.\n", "Adversaria, 73. 86.\n", "Elfric’s colloquy, 168. 197, 232. 248. 278.\n", "Elian, translation of, 267. 284.\n", "A. (F. RB.) on Sterne’s Koran, 418.\n", "—— on a passage in Goldsmith, 83.\n", "— Queen of Hearts, 320.\n", "Agricola(C.), Propugnaculum anti-Pistori- anum, 203.\n", "A. (J. D.) on swords worn in public, 415.\n", "Alban’s (St.) Day, 399.\n", "—._, law courts at, 306.\n", "Albert (Le Petit), 474.\n", "Alchemy, metrical writings on, 60.\n", "Ale »xandria a (Ptolemy of), 142. 170.\n", "Alfred’s (King) Geography of Europe, 257. 313.\n", "— works, 93.\n", "Alicui on Bec ket’s grace-cup, 143.\n", "—— on Bishop Barnaby, 132.\n", "All Angels and St. Michael’s, feast of, 235.\n", "* All to-broke,” 490.\n", "Allusion in Friar Brackley’s sermon, 35).\n", "Almanack (Poor Robia’s), 470.\n", "Alms-basins\n" ] } ], "source": [ "from ia_utils.download_and_extract_text import download_and_extract_text\n", "\n", "simple_index_text = download_and_extract_text(sample_index_id)\n", "\n", "print(simple_index_text[:1000])" ] }, { "cell_type": "code", "execution_count": 6, "id": "8922ccf1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "A. \n", "\n", "\n", "Apsey of St. Wandrille, 382. 486. \n", "\n", "Abdication of James I1., 39. 489. \n", "\n", "Aberdeen, Burnet prize at, 91. \n", "\n", "Aboriginal chambers near Tilbury, 462. \n", "\n", "A. (B.) on emancipation of the Jews, 475. \n", "\n", "Accuracy of references, 170. \n", "\n", "Addison's books, 212. \n", "\n", "Adolphus on a recent novel, 231. \n", "\n", "Advent bells, 121. \n", "\n", "Adversaria, 73. 86. \n", "\n", "Elfric’s colloquy, 168. 197, 232. 248. 278. \n", "\n", "Elian, translation of, 267. 284. \n", "\n", "A. (F. RB.) on Sterne’s Koran, 418. \n", "\n", "—— on a passage in Goldsmith, 83. \n", "\n", "— Queen of Hearts, 320. \n", "\n", "Agricola(C.), Propugnaculum anti-Pistori- \n", "anum, 203. \n", "\n", "A. (J. D.) on swords worn in public, 415. \n", "\n", "Alban’s (St.) Day, 399. \n", "\n", "—._, law courts at, 306. \n", "\n", "Albert (Le Petit), 474. \n", "\n", "Alchemy, metrical writings on, 60. \n", "\n", "Ale »xandria a (Ptolemy of), 142. 170. \n", "\n", "Alfred’s (King) Geography of Europe, 257. \n", "313. \n", "\n", "\n", "— works, 93. \n", "\n", "Alicui on Bec ket’s grace-cup, 143. \n", "\n", "—— on Bishop Barnaby, 132. \n", "\n", "All Angels and St. Michael’s, feast of, 235. \n", "\n", "* All to-broke,” 490. \n", "\n", "Allusion in Friar\n" ] } ], "source": [ "sample_index_text = download_and_extract_text(sample_index_id, typ=\"djvutxt\")\n", "\n", "print(sample_index_text[:1000])" ] }, { "cell_type": "code", "execution_count": 7, "id": "06be6a4f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sim_notes-and-queries_1849-1850_1_index'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_index_id" ] }, { "cell_type": "markdown", "id": "9cb605e7", "metadata": {}, "source": [ "Inspecting some of the documents shows that there is no guarantee that the search text correctly represents index items on a new line, although in certain documents it appears as if line breaks after each entry are provided (as in the original scanned image).\n", "\n", "There are also \"sub-elements\" on separate lines that relate to a major heading that we really need to \"fill down\" on, although there is may be no indication in the text (e.g. no series of dashes or a tab characters) to indicate the the subsidiary nature of a reference. (Note that there may be further clues in the original XML, for example, from the location of the text.) However, subsidiary entries do often appear to start with a lower case letter, so let's use that as a heuristic: *if the line starts with a lower case letter, it's a subsidiary entry*. More detailed inspection of the index search text also suggests that in some cases `-` separator characters may appear in the search text.\n", "\n", "To create a complete index, one possible approach is to:\n", "\n", "- normalise a single entry and all its subsidiary entries onto a single line;\n", "- parse a single entry and all its subsidiary entries into appropriate database records.\n", "\n", "Rather than consider the XML and all the additional processing that incurs, let's try to \"repair\" the document as best we can. Another thing we *could* try to exploit is the alphabetical order of entries, but let's leave that as an open question and only return to it if we find issues occurring that alphabetisation might help us address.\n", "\n", "So let's start by repairing the text and normalising the lines before considering how to parse the entries." ] }, { "cell_type": "markdown", "id": "03f00b5b", "metadata": {}, "source": [ "### Reinserting Line Breaks\n", "\n", "If we can identify where line breaks are likely to be missing, we should be able to reinsert them.\n", "\n", "By inspection of the raw search text, it seems that we have a page number (digits), space character, and then typically the next entry start by a capital letter (subsidiary lines seem to start with a lower case character). We can perform a regular expression substitution to match this pattern and replace the space after the final page number with an end-of-line character.\n", "\n", "Some lines also start with opening quotes of various flavours (`‘` or `“` for example), or incorrectly recognised quotes rendered as a `*` character. We can also insert line breaks in advance of these:" ] }, { "cell_type": "code", "execution_count": 8, "id": "b7e62572", "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "def repair_index_missing_line_breaks(text):\n", " \"\"\"Attempt to repair missing line breaks.\"\"\"\n", " # Add line break after page number\n", " # allowing a single optional grace character at end for incorrect OCR\n", " repaired_text = re.sub(r\"([0-9].?\\s*\\.?)[\\s]+([\\(‘“\\\"'\\*A-Z])\", r'\\1\\n\\2', text)\n", " \n", " return repaired_text" ] }, { "cell_type": "markdown", "id": "181bec9d", "metadata": {}, "source": [ "Let's see how that looks:" ] }, { "cell_type": "code", "execution_count": 9, "id": "dca94c65", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "A. \n", "\n", "\n", "Apsey of St. Wandrille, 382. 486.\n", "Abdication of James I1., 39. 489.\n", "Aberdeen, Burnet prize at, 91.\n", "Aboriginal chambers near Tilbury, 462.\n", "A. (B.) on emancipation of the Jews, 475.\n", "Accuracy of references, 170.\n", "Addison's books, 212.\n", "Adolphus on a recent novel, 231.\n", "Advent bells, 121.\n", "Adversaria, 73. 86.\n", "Elfric’s colloquy, 168. 197, 232. 248. 278.\n", "Elian, translation of, 267. 284.\n", "A. (F. RB.) on Sterne’s Koran, 418. \n", "\n", "—— on a passage in Goldsmith, 83. \n", "\n", "— Queen of Hearts, 320.\n", "Agricola(C.), Propugnaculum anti-Pistori- \n", "anum, 203.\n", "A. (J. D.) on swords worn in public, 415.\n", "Alban’s (St.) Day, 399. \n", "\n", "—._, law courts at, 306.\n", "Albert (Le Petit), 474.\n", "Alchemy, metrical writings on, 60.\n", "Ale »xandria a (Ptolemy of), 142. 170.\n", "Alfred’s (King) Geography of Europe, 257. \n", "313. \n", "\n", "\n", "— works, 93.\n", "Alicui on Bec ket’s grace-cup, 143. \n", "\n", "—— on Bishop Barnaby, 132.\n", "All Angels and St. Michael’s, feast of, 235.\n", "* All to-broke,” 490.\n", "Allusion in Friar\n" ] } ], "source": [ "repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )\n", "\n", "print(repaired_sample_index)" ] }, { "cell_type": "markdown", "id": "905e3fe6", "metadata": {}, "source": [ "### Removing Unwanted Line Breaks\n", "\n", "If what appear to be page numbers appear on the their own line, they should presumably appear as page numbers for the previous reference.\n", "\n", "In other cases, a subsidiary reference might incorrectly be place on one line, or a line might end on a comma. In such cases, we might assume the associated line breaks to be unwanted.\n", "\n", "So let's replace the line breaks in those locations with spaces, and then also replace any double spaces we might have introduced (or that were present withing the original scanned text) with a single space:" ] }, { "cell_type": "code", "execution_count": 10, "id": "af87571b", "metadata": {}, "outputs": [], "source": [ "def repair_index_unwanted_line_breaks(text):\n", " \"\"\"Attempt to repair extraneous line breaks.\"\"\"\n", " # Fix unwanted line end before page number\n", " repaired_text = re.sub(r\"\\n([0-9].*)\", r' \\1', text)\n", " # Fix unwanted line end before subsidiary entry (initial lower case character)\n", " # Identify subsidiary split with a ::: separator\n", " repaired_text = re.sub(r\"\\n([a-z].*)\", r' ::: \\1', repaired_text)\n", " # Fix unwanted line break after comma\n", " #repaired_text = re.sub(r\",\\s*\\n\", r', ZZ', repaired_text)\n", " \n", " # Remove duplicate spaces\n", " repaired_text = re.sub(r\" \", r' ', repaired_text)\n", " \n", " return repaired_text" ] }, { "cell_type": "markdown", "id": "c49de6d1", "metadata": {}, "source": [ "How do things look now?" ] }, { "cell_type": "code", "execution_count": 11, "id": "add1b4bd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "A. \n", "\n", "\n", "Apsey of St. Wandrille, 382. 486.\n", "Abdication of James I1., 39. 489.\n", "Aberdeen, Burnet prize at, 91.\n", "Aboriginal chambers near Tilbury, 462.\n", "A. (B.) on emancipation of the Jews, 475.\n", "Accuracy of references, 170.\n", "Addison's books, 212.\n", "Adolphus on a recent novel, 231.\n", "Advent bells, 121.\n", "Adversaria, 73. 86.\n", "Elfric’s colloquy, 168. 197, 232. 248. 278.\n", "Elian, translation of, 267. 284.\n", "A. (F. RB.) on Sterne’s Koran, 418. \n", "\n", "—— on a passage in Goldsmith, 83. \n", "\n", "— Queen of Hearts, 320.\n", "Agricola(C.), Propugnaculum anti-Pistori- ::: anum, 203.\n", "A. (J. D.) on swords worn in public, 415.\n", "Alban’s (St.) Day, 399. \n", "\n", "—._, law courts at, 306.\n", "Albert (Le Petit), 474.\n", "Alchemy, metrical writings on, 60.\n", "Ale »xandria a (Ptolemy of), 142. 170.\n", "Alfred’s (King) Geography of Europe, 257. 313. \n", "\n", "\n", "— works, 93.\n", "Alicui on Bec ket’s grace-cup, 143. \n", "\n", "—— on Bishop Barnaby, 132.\n", "All Angels and St. Michael’s, feast of, 235.\n", "* All to-broke,” 490.\n", "Allusion in Friar\n" ] } ], "source": [ "repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )\n", "repaired_sample_index = repair_index_unwanted_line_breaks( repaired_sample_index )\n", "\n", "print(repaired_sample_index)" ] }, { "cell_type": "markdown", "id": "54f34679", "metadata": {}, "source": [ "Inspecting the above, we see there are \"issues\" that we might be able to address, such as line entries that should be separated, based on a closer inspection of the XML returned from the scan that includes the position on the page.\n", "\n", "But at least we have something to work with." ] }, { "cell_type": "markdown", "id": "ad894987", "metadata": {}, "source": [ "### Parsing Entries and Adding Them to a Database\n", "\n", "Let's now consider how we might structure our database entries.\n", "\n", "First, we have simple \"primary\" entries, such as *Agincourt, Sir Hilary charged at, 158. 190.*\n", "\n", "We might put this into a record of the form:\n", "\n", "```json\n", "[{\"source_id\": id_val, \"index_term\": \"Agincourt, Sir Hilary charged at\", \"page\": 158}\n", "{\"source_id\": id_val, \"index_term\": \"Agincourt, Sir Hilary charged at\", \"page\": 190}]\n", "```\n", "\n", "The page numbers are relative to a particular volume, so we also need to be able to capture information to identify what the page numbers are with reference to. The index document filenames take the form *Notes and Queries 1875: Vol 3 Index* so we can parse out the year and volume and add these to the record too. " ] }, { "cell_type": "code", "execution_count": 12, "id": "e68e23cc", "metadata": {}, "outputs": [], "source": [ "from parse import parse\n", "\n", "def get_index_metadata_from_title(title):\n", " \"\"\"Get year and volume from title.\"\"\"\n", " metadata = parse(\"Notes and Queries {year}: Vol {vol} Index\", title)\n", " if metadata:\n", " metadata = {\"year\": metadata[\"year\"], \"vol\": metadata[\"vol\"]}\n", " else:\n", " metadata = {\"year\": None, \"vol\": None}\n", " return metadata" ] }, { "cell_type": "markdown", "id": "d3a2562e", "metadata": {}, "source": [ "Here's how it works:" ] }, { "cell_type": "code", "execution_count": 13, "id": "39613aca", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'year': '1875', 'vol': '3'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_index_page_title = \"Notes and Queries 1875: Vol 3 Index\"\n", "\n", "get_index_metadata_from_title(sample_index_page_title)" ] }, { "cell_type": "markdown", "id": "6ffcd831", "metadata": {}, "source": [ "In the table, we might also provide a `type` column to distinguish between primary (`P`) and subsidiary (`S`) entries, along with subsidiary column which should be empty in simple cases.\n", "\n", "For a line entry such as *A. (E. H.) on baptismal superstition, 197. on curfew, at Morpeth, 312. on Duresme and Dunelm, 206.* we not the the first entry is actually a subsidiary entry, the `on` keyword identifying the subsidiarity to the main term `A. (E. H.)`.\n", "\n", "We might then desire to have partial records of the form:\n", "\n", "```json\n", "[{\"index_term\": \"A. (E. H.)\", \"typ\": \"S\", \"page\": 197, \"subsidiary\": \"on baptismal superstition\"},\n", "{\"index_term\": \"A. (E. H.)\", \"typ\": \"S\", \"page\": 312, \"subsidiary\": \"on curfew, at Morpeth\"},\n", "{\"index_term\": \"A. (E. H.)\", \"typ\": \"S\", \"page\": 206, \"subsidiary\": \"on Duresme and Dunelm,\"}\n", "]\n", "```\n", "\n", "Inspection of other records with subsidiary terms suggests that a comma may also be used as to denote initial subsidiarity, as or example illustrated here:\n", "\n", "`Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217.`\n", "\n", "In this case, the multiple items are based on the original term before the initial comma (this might be a hasty assumption if the key term itself includes a comma, but the we might hope for an \"on\" separator to clarify the position.\n", "\n", "*We also note in that example a possible repair we could make to the original text: removing the `word- split` hyphenation.*" ] }, { "cell_type": "code", "execution_count": 14, "id": "2e97fca3", "metadata": {}, "outputs": [], "source": [ "def clean_text_remove_word_split_hyphenation(text):\n", " \"\"\"Remove word split hyphenation.\"\"\"\n", " cleaned_text = re.sub(r\"([a-z])[-—–][\\n]([a-z])\", r'\\1\\2', text)\n", " \n", " return cleaned_text" ] }, { "cell_type": "markdown", "id": "12bb0ed0", "metadata": {}, "source": [ "Let's do a quick test of that:" ] }, { "cell_type": "code", "execution_count": 15, "id": "3e1099e6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Berkeley (Bishop), adventures of Gaudentio di Lucca, 247.successful experiments, 217.'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_eol_hyphenation = \"Berkeley (Bishop), adventures of Gau-\\ndentio di Lucca, 247.successful experiments, 217.\"\n", "\n", "clean_text_remove_word_split_hyphenation(test_eol_hyphenation)" ] }, { "cell_type": "markdown", "id": "af7b56b8", "metadata": {}, "source": [ "So let's start by suggesting the following database record structure as something to work towards:" ] }, { "cell_type": "code", "execution_count": 16, "id": "fa064a78", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting ia_utils/create_db_table_index_entries.py\n" ] } ], "source": [ "%%writefile ia_utils/create_db_table_index_entries.py\n", "def create_db_table_index_entries(db, drop=True):\n", " \"\"\"Create an index_entries database table and an associated full-text search table.\"\"\"\n", " # If we want to remove the table completely, we can drop it\n", " table_name = \"index_entries\"\n", " db[table_name].drop(ignore=True)\n", " db[f\"{table_name}_fts\"].drop(ignore=True)\n", "\n", " db[table_name].create({\n", " \"source_id\": str, \n", " \"year\": str,\n", " \"vol\": str,\n", " \"index_term\": str, \n", " \"typ\": str,\n", " \"subsidiary\": str,\n", " \"page_num\": int\n", " })\n", "\n", " # Enable full text search\n", " # This creates an extra virtual table ({table_name}_fts) to support the full text search\n", " db[table_name].enable_fts([\"source_id\", \"index_term\", \"subsidiary\", \"year\", \"vol\", \"page_num\"],\n", " create_triggers=True, tokenize=\"porter\")" ] }, { "cell_type": "markdown", "id": "b8f0e5a4", "metadata": {}, "source": [ "Load in the package and create the index entries database table:" ] }, { "cell_type": "code", "execution_count": 17, "id": "0788041b", "metadata": {}, "outputs": [], "source": [ "from ia_utils.create_db_table_index_entries import create_db_table_index_entries\n", "\n", "create_db_table_index_entries(db)" ] }, { "cell_type": "markdown", "id": "87669346", "metadata": {}, "source": [ "We now need to consider various ways of parsing line items, including:\n", " \n", "- extracting multiple page numbers for a single entry;\n", "- identifying entries that mask subsidiary terms.\n", "\n", "We have already adopted a convention of using `:::` to separate subsidiary items, so let's apply that a bit further to separate out \"on\" terms and comma separated terms. We might also have a catch all in case there are elements appearing after a page number that are perhaps rightly new entries but that we shall treat as subsidiaries.\n", "\n", "We could possibly also try to \"fudge\" page numbers that look like numbers-ish, for eexample, if there is a set of numbers that ends with an `s` or a `z`. where we might guess (possibly incorrectly) at a `5` or `2`." ] }, { "cell_type": "code", "execution_count": 18, "id": "e508b1e1", "metadata": {}, "outputs": [], "source": [ "def _repair_index_subsidiary_separator_line(text):\n", " \"\"\"Repair entries at line level.\"\"\"\n", " \n", " # Very risky number substitutions\n", " # We want to access \\1 so we need the alternative syntax\n", " repaired_text = re.sub(r\"([0-9])[sS]\\.?\", r'\\g<1>5', text)\n", " repaired_text = re.sub(r\"([0-9])[zZ]\\.?\", r'\\g<1>2', repaired_text)\n", " \n", " # Subsidiary terms based on \"on\" - this may be overly aggressive to be starting with\n", " repaired_text = re.sub(r\"([^(on)]*)( on .*)\", r'\\1 ::: \\2', repaired_text)\n", " # Subsidiary terms based on dashes at start of line\n", " repaired_text = re.sub(r'^[-—–]+', r' ::: ', repaired_text)\n", " # Subsidiary terms based on multiple dashes within line (unlikely to be hyphen)\n", " repaired_text = re.sub(r'[-—–]{2,}', r' ::: ', repaired_text)\n", " # Subsidiary terms based on dash after a number\n", " repaired_text = re.sub(r'([0-9\\.,]+\\s*)[-—–]+', r'\\1 :::', repaired_text)\n", " \n", " # Subsidiary terms based on page numbers\n", " repaired_text = re.sub(r\"([0-9]\\.) *([‘“\\\"'\\*A-Za-z])\", r'\\1 ::: \\2', repaired_text)\n", " # Subsidiary terms based on \"on\" - this may be overly aggressive\n", " #repaired_text = re.sub(r\"^([^:]*)( on .*)\", r'\\1 ::: \\2', repaired_text)\n", " # Or only apply after a number\n", " #repaired_text = re.sub(r\"([0-9]\\.)\\s*(on)\", r'\\1 ::: \\2', repaired_text)\n", " if \"::: on\" in repaired_text:\n", " # Also split at start\n", " repaired_text = re.sub(r\"^([^(on)]*) (on)\", r\"\\1 ::: \\2\", repaired_text)\n", " # Subsidiary terms based on \",\"\n", " #elif \":::\" in repaired_text:\n", " # If we have numbers separated by commas, replace the commas with a .\n", " repaired_text = re.sub(r'(\\s+[0-9]+)\\s*,\\s*([0-9]+)',r'\\1. \\2', repaired_text)\n", " # If we have a comma before a number, separate after the number\n", " # Allow a grace character\n", " re.sub(r\"^([^:,]*),\\s*([0-9][0-9\\.\\s]+[A-Za-z]?)[^\\n]\", r'\\1 \\2:::', repaired_text)\n", " # If we have a comma appear before a separator, separate on it\n", " repaired_text = re.sub(r\"^([^:,]*),\\s*([^0-9]+)\", r'\\1 :::\\2', repaired_text)\n", "\n", " # Provide a catch all to add separators after what look like page numbers\n", " repaired_text = re.sub(r\"([0-9]\\s*[^:].?)\\s*([A-Za-z].*)$\", r'\\1 ::: \\2', repaired_text)\n", " \n", " # Remove uncaught dashes at start and end of phrase\n", " repaired_text = \":::\".join([p.strip(\"-—– \") for p in repaired_text.split(\":::\")])\n", " \n", " return repaired_text\n", "\n", "\n", "def repair_index_subsidiary_separator(text):\n", " \"\"\"Attempt to identify where subsidiary splits occur.\"\"\"\n", " # These are applied at the line level\n", " repaired_lines = [_repair_index_subsidiary_separator_line(line.strip()) for line in text.split(\"\\n\") if line]\n", " \n", " # Patch back any overly aggressively split lines\n", " return \"\\n\".join(repaired_lines).replace(\"\\n:::\", \":::\")" ] }, { "cell_type": "markdown", "id": "0694401f", "metadata": {}, "source": [ "Let's see how that works:" ] }, { "cell_type": "code", "execution_count": 19, "id": "8aae8b29", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A.\n", "Apsey of St. Wandrille:::382. 486.\n", "Abdication of James I1.:::39. 489.\n", "Aberdeen:::Burnet prize at, 91.\n", "Aboriginal chambers near Tilbury:::462.\n", "A. (B.):::on emancipation of the Jews, 475.\n", "Accuracy of references:::170.\n", "Addison's books:::212.\n", "Adolphus:::on a recent novel, 231.\n", "Advent bells:::121.\n", "Adversaria:::73. 86.\n", "Elfric’s colloquy:::168. 197. 232. 248. 278.\n", "Elian:::translation of, 267. 284.\n", "A. (F. RB.):::on Sterne’s Koran, 418.::::::on a passage in Goldsmith, 83.:::Queen of Hearts, 320.\n", "Agricola(C.):::Propugnaculum anti-Pistori:::anum, 203.\n", "A. (J. D.):::on swords worn in public, 415.\n", "Alban’s (St.) Day:::399.:::._, law courts at, 306.\n", "Albert (Le Petit):::474.\n", "Alchemy:::metrical writings on, 60.\n", "Ale »xandria a (Ptolemy of):::142. 170.\n", "Alfred’s (King) Geography of Europe:::257. 313.:::works, 93.\n", "Alicui:::on Bec ket’s grace-cup, 143.::::::on Bishop Barnaby, 132.\n", "All Angels and St. Michael’s:::feast of, 235.\n", "* All to-broke:::” 490.\n", "Allusion in Friar\n" ] } ], "source": [ "repaired_sample_index2 = repaired_sample_index\n", "repaired_sample_index2 = repair_index_subsidiary_separator(repaired_sample_index2)\n", "\n", "print(repaired_sample_index2)" ] }, { "cell_type": "markdown", "id": "2aaddd3c", "metadata": {}, "source": [ "And for the comma separator:" ] }, { "cell_type": "code", "execution_count": 20, "id": "b0db99f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Berkeley (Bishop):::adventures of Gau- dentio di Lucca, 247.:::successful experiments, 217.'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_comma_subsidiary = \"Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217.\"\n", "\n", "repair_index_subsidiary_separator(text_comma_subsidiary)" ] }, { "cell_type": "markdown", "id": "1396eeae", "metadata": {}, "source": [ "Having made an attempt at some subsidiary separators, we can now try to parse out the various components. At the start of the line we have the primary entry, then we may have one or more line numbers or one or more subsidiary phrases.\n", "\n", "Let's look at how to parse out page numbers. There may be one or more page numbers separated by spaces or by `.` characters." ] }, { "cell_type": "code", "execution_count": 21, "id": "ca70f50c", "metadata": {}, "outputs": [], "source": [ "# This is a rather crude approach that just grabs all the numbers we can find\n", "def extract_page_numbers_from_line(text):\n", " \"\"\"Extract one or more page numbers from text.\"\"\"\n", " # Try to nudge things towards finding numbers at the end of the phrase\n", " end_of_text = re.sub(r'^[^0-9]*([0-9\\.,\\s]*$)', r'\\1', text)\n", " start_of_text = text.replace(end_of_text, '')\n", " # Then just bludgeon out all the possible page numbers\n", " page_numbers = re.findall(r'\\d+', end_of_text)\n", " return start_of_text, page_numbers" ] }, { "cell_type": "markdown", "id": "d2851234", "metadata": {}, "source": [ "Let's see how that works:" ] }, { "cell_type": "code", "execution_count": 22, "id": "ad479d49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('adventures of Gau- dentio di Lucca, ', ['247']),\n", " ('successful experiments, ', ['217'])]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use a test example of subsidiary elements; there is no page number in the first part\n", "[extract_page_numbers_from_line(t) for t in repair_index_subsidiary_separator(text_comma_subsidiary).split(\":::\")[1:]]" ] }, { "cell_type": "markdown", "id": "ce9037e7", "metadata": {}, "source": [ "And if there are no numbers?" ] }, { "cell_type": "code", "execution_count": 23, "id": "05914123", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('No numbers here', [])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract_page_numbers_from_line(\"No numbers here\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "a73bb67f", "metadata": {}, "outputs": [], "source": [ "def parse_index_line(text):\n", " \"\"\"Parse out elements of the index entry.\"\"\"\n", " \n", " # Split the entry in subsidiary parts and clean white space\n", " parts = [p.strip() for p in text.split(\":::\")]\n", "\n", " # Do we have one entry or many?\n", " if len(parts) == 1:\n", " # There are no subsidiary parts\n", " # The first part is the main index entry\n", " # from which we need to separate one or more page references\n", " entry_text, page_numbers = extract_page_numbers_from_line(parts[0])\n", " index_entries = [{\"index_term\": entry_text, \"typ\": \"P\",\n", " \"page_numbers\": page_numbers}]\n", " else:\n", " # There are subsidiary parts\n", " # In this case, we get each subsidiary part and its page references\n", " # Get the subsidiary parts\n", " index_entries = []\n", " for p in parts[1:]:\n", " entry_text, page_numbers = extract_page_numbers_from_line(p)\n", " subsidiary_entry = {\"index_term\": parts[0],\n", " \"subsidiary\": entry_text, \"typ\": \"S\",\n", " \"page_numbers\": page_numbers}\n", " index_entries.append(subsidiary_entry)\n", "\n", " return index_entries" ] }, { "cell_type": "code", "execution_count": 25, "id": "8dd96268", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'index_term': '“ Noise\" derivations of ',\n", " 'typ': 'P',\n", " 'page_numbers': ['81', '106', '138', '218', '35']}]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parse_index_line('“ Noise\" derivations of 81. 106. 138. 218. 35')" ] }, { "cell_type": "markdown", "id": "05c4b260", "metadata": {}, "source": [ "So does that work?!" ] }, { "cell_type": "code", "execution_count": 26, "id": "ac2eca1f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'index_term': 'Aboriginal chambers near Tilbury',\n", " 'subsidiary': '',\n", " 'typ': 'S',\n", " 'page_numbers': ['462']}]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parse_index_line(repaired_sample_index2.split(\"\\n\")[4])" ] }, { "cell_type": "markdown", "id": "7cb8ed49", "metadata": {}, "source": [ "In the above case, we have an error in that we have rolled one index entry as a subsidiary to an initial index entry because of a missing page number for the first entry.\n", "\n", "*In this case, alphabetic sorting checks across several index entries (and subsidiaries) might help us detect this error; for example, if a subsidiary term sorts between the index term and the next index term, we might guess that the subsidiary is actually a main index term.*\n", "\n", "Note that if we construct a full text search across the `index_term` and `subsidiary` columns, we are likely to get false positives but we shouldn't miss anything..." ] }, { "cell_type": "markdown", "id": "8c18d13f", "metadata": {}, "source": [ "We can now try to create a complete set of records that we could upload to out database.\n", "\n", "To start with, we need the metadata, which means we need the title." ] }, { "cell_type": "code", "execution_count": 27, "id": "a38c9854", "metadata": {}, "outputs": [], "source": [ "def get_title_from_id(db, id_val):\n", " \"\"\"get the title of the issue from the database.\"\"\"\n", " q = f'SELECT title FROM metadata WHERE id=\"{id_val}\"'\n", "\n", " return read_sql(q, db.conn)[\"title\"][0]" ] }, { "cell_type": "markdown", "id": "077c48d6", "metadata": {}, "source": [ "For example:" ] }, { "cell_type": "code", "execution_count": 28, "id": "04dc3b56", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'year': ' 1849 - 1850', 'vol': '1'}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index_base_data = get_index_metadata_from_title(get_title_from_id(db, sample_index_id))\n", "index_base_data" ] }, { "cell_type": "markdown", "id": "e5a35718", "metadata": {}, "source": [ "Now we need to separate each line item into multiple items. The `pandas` dataframe can come to out aid here, with its ability to easily split out listed items in one cell onto multiple rows:" ] }, { "cell_type": "code", "execution_count": 29, "id": "528b392f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
index_termsubsidiarytyppage_numbers
0Aboriginal chambers near TilburyS[462]
\n", "
" ], "text/plain": [ " index_term subsidiary typ page_numbers\n", "0 Aboriginal chambers near Tilbury S [462]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "example_subsidiary_df = pd.DataFrame(parse_index_line(repaired_sample_index2.split(\"\\n\")[4]))\n", "example_subsidiary_df" ] }, { "cell_type": "markdown", "id": "300a9ac3", "metadata": {}, "source": [ "We can now \"explode\" that dataframe against the lists of page numbers to get one row per item:" ] }, { "cell_type": "code", "execution_count": 30, "id": "2cdea110", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
index_termsubsidiarytyppage_num
0Aboriginal chambers near TilburyS462
\n", "
" ], "text/plain": [ " index_term subsidiary typ page_num\n", "0 Aboriginal chambers near Tilbury S 462" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_subsidiary_df.explode('page_numbers').rename(columns={\"page_numbers\": \"page_num\"})" ] }, { "cell_type": "markdown", "id": "ebc59925", "metadata": {}, "source": [ "Let's see if we can now put all those pieces together. Essentially, for each index line, we need to generate the complete set of records we want to add to the database." ] }, { "cell_type": "code", "execution_count": 31, "id": "ff1ab252", "metadata": {}, "outputs": [], "source": [ "def construct_index_records(id_val=None, text=None, metadata=None, retval=\"explode\"):\n", " \"\"\"Generate a complete set of index records from original search text document.\"\"\"\n", " if id_val is None and text is None:\n", " return []\n", "\n", " text = download_and_extract_text(id_val, typ=\"djvutxt\") if text is None else text\n", "\n", " records = []\n", " # Repair the text\n", " repaired_text = repair_index_missing_line_breaks( text )\n", " repaired_text = repair_index_unwanted_line_breaks( repaired_text )\n", " repaired_text = repair_index_subsidiary_separator( repaired_text )\n", "\n", " for line in repaired_text.split(\"\\n\"):\n", " if line:\n", " new_line = parse_index_line(line)\n", " records.extend(new_line)\n", " \n", " if retval not in [\"df\", \"explode\"] or id_val is None:\n", " # Return the list of dicts, without the metadata\n", " return records\n", "\n", " # WARNING - if we used provided text, the id_val and the text may actually be inconsistent\n", " index_base_data = get_index_metadata_from_title(get_title_from_id(db, id_val))\n", " # Generate a dataframe\n", " records_df = pd.DataFrame(records)\n", " \n", " records_df[\"source_id\"] = id_val\n", " records_df[\"year\"] = index_base_data[\"year\"]\n", " records_df[\"vol\"] = index_base_data[\"vol\"]\n", " \n", " if retval==\"explode\":\n", " return records_df.explode('page_numbers').rename(columns={\"page_numbers\": \"page_num\"})\n", " elif retval==\"df\":\n", " return records_df" ] }, { "cell_type": "markdown", "id": "7d16a9a4", "metadata": {}, "source": [ "And when we run it:" ] }, { "cell_type": "code", "execution_count": 32, "id": "79607bb7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
index_termtyppage_numsubsidiarysource_idyearvol
844Complutensian PolygiotS402sim_notes-and-queries_1849-1850_1_index1849 - 18501
844Complutensian PolygiotS431sim_notes-and-queries_1849-1850_1_index1849 - 18501
845Compton StreetS228Soho,sim_notes-and-queries_1849-1850_1_index1849 - 18501
846Conrad of Salisbury’P8NaNsim_notes-and-queries_1849-1850_1_index1849 - 18501
847Descriptio utriusquePNaNNaNsim_notes-and-queries_1849-1850_1_index1849 - 18501
........................
1015D.PNaNNaNsim_notes-and-queries_1849-1850_1_index1849 - 18501
1016D.SNaNon Lord Chatham's speech on Americansim_notes-and-queries_1849-1850_1_index1849 - 18501
1017D.S12stamp act,sim_notes-and-queries_1849-1850_1_index1849 - 18501
1018D.S214iden frog,sim_notes-and-queries_1849-1850_1_index1849 - 18501
1019D.SNaNsim_notes-and-queries_1849-1850_1_index1849 - 18501
\n", "

200 rows × 7 columns

\n", "
" ], "text/plain": [ " index_term typ page_num \\\n", "844 Complutensian Polygiot S 402 \n", "844 Complutensian Polygiot S 431 \n", "845 Compton Street S 228 \n", "846 Conrad of Salisbury’ P 8 \n", "847 Descriptio utriusque P NaN \n", "... ... .. ... \n", "1015 D. P NaN \n", "1016 D. S NaN \n", "1017 D. S 12 \n", "1018 D. S 214 \n", "1019 D. S NaN \n", "\n", " subsidiary \\\n", "844 \n", "844 \n", "845 Soho, \n", "846 NaN \n", "847 NaN \n", "... ... \n", "1015 NaN \n", "1016 on Lord Chatham's speech on American \n", "1017 stamp act, \n", "1018 iden frog, \n", "1019 \n", "\n", " source_id year vol \n", "844 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "844 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "845 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "846 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "847 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "... ... ... .. \n", "1015 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "1016 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "1017 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "1018 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "1019 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1 \n", "\n", "[200 rows x 7 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "construct_index_records(sample_index_id)[1000: 1200]" ] }, { "cell_type": "markdown", "id": "3221649b", "metadata": {}, "source": [ "It's far from ideal, but at least gives us something to work with. So let's add it to the database, and see how a search feels." ] }, { "cell_type": "code", "execution_count": 33, "id": "0107ddcc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db[\"index_entries\"].insert_all(construct_index_records(sample_index_id).to_dict(orient=\"records\"))" ] }, { "cell_type": "markdown", "id": "d839be04", "metadata": {}, "source": [ "Let's try a search:" ] }, { "cell_type": "code", "execution_count": 34, "id": "0b3e82c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
source_idindex_termsubsidiaryyearvolpage_num
0sim_notes-and-queries_1849-1850_1_indexCure for the hooping-cough1849 - 18501397
1sim_notes-and-queries_1849-1850_1_indexG. (J.)charm to cure the murrain in cows,1849 - 18501349
2sim_notes-and-queries_1849-1850_1_indexWartscharms for cure of,1849 - 18501349
3sim_notes-and-queries_1849-1850_1_indexWartscharms for cure of,1849 - 18501482
\n", "" ], "text/plain": [ " source_id index_term \\\n", "0 sim_notes-and-queries_1849-1850_1_index Cure for the hooping-cough \n", "1 sim_notes-and-queries_1849-1850_1_index G. (J.) \n", "2 sim_notes-and-queries_1849-1850_1_index Warts \n", "3 sim_notes-and-queries_1849-1850_1_index Warts \n", "\n", " subsidiary year vol page_num \n", "0 1849 - 1850 1 397 \n", "1 charm to cure the murrain in cows, 1849 - 1850 1 349 \n", "2 charms for cure of, 1849 - 1850 1 349 \n", "3 charms for cure of, 1849 - 1850 1 482 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_term = \"cure\"\n", "\n", "q = f\"\"\"\n", "SELECT * FROM index_entries_fts\n", "WHERE index_entries_fts MATCH {db.quote(search_term)};\n", "\"\"\"\n", "\n", "read_sql(q, db.conn)" ] }, { "cell_type": "markdown", "id": "1d47af4a", "metadata": {}, "source": [ "Let's create a search index over all the index issues up to 1900 excluding the cumulative indexes.\n", "\n", "First, grab all the indexes from the database and then filter to just the years we are interested in.\n", "\n", "*We really should support convenient year searching by adding a year column to the table, or creating a convenient, custom query function to handle years.*" ] }, { "cell_type": "code", "execution_count": 35, "id": "71952d3b", "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "[{'year': 1850,\n", " 'id': 'sim_notes-and-queries_1849-1850_1_index',\n", " 'date': '1849 - 1850',\n", " 'datetime': '1850-03-20T00:00:00',\n", " 'series': None,\n", " 'vol': '1',\n", " 'iss': 'Index',\n", " 'title': 'Notes and Queries 1849 - 1850: Vol 1 Index',\n", " 'next_id': 'sim_notes-and-queries_1849-11-03_1_1',\n", " 'prev_id': '',\n", " 'is_index': 1,\n", " 'restricted': ''},\n", " {'year': 1850,\n", " 'id': 'sim_notes-and-queries_1850_2_index',\n", " 'date': '1850',\n", " 'datetime': '1850-03-20T00:00:00',\n", " 'series': None,\n", " 'vol': '2',\n", " 'iss': 'Index',\n", " 'title': 'Notes and Queries 1850: Vol 2 Index',\n", " 'next_id': 'sim_notes-and-queries_1850-06-01_2_31',\n", " 'prev_id': 'sim_notes-and-queries_1850-05-25_1_30',\n", " 'is_index': 1,\n", " 'restricted': ''},\n", " {'year': 1851,\n", " 'id': 'sim_notes-and-queries_1851_3_index',\n", " 'date': '1851',\n", " 'datetime': '1851-03-20T00:00:00',\n", " 'series': None,\n", " 'vol': '3',\n", " 'iss': 'Index',\n", " 'title': 'Notes and Queries 1851: Vol 3 Index',\n", " 'next_id': 'sim_notes-and-queries_1851-01-04_3_62',\n", " 'prev_id': 'sim_notes-and-queries_1850-12-28_2_61',\n", " 'is_index': 1,\n", " 'restricted': ''}]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = \"\"\"\n", "SELECT * FROM (SELECT CAST(strftime(\"%Y\", datetime) AS INT) AS year, *\n", " FROM metadata\n", " WHERE is_index=1 AND id NOT LIKE \"%cumulative%\") WHERE year < 1900;\n", "\"\"\"\n", "\n", "indexes = read_sql(q, db.conn)\n", "\n", "indexes = indexes.to_dict(orient=\"records\")\n", "indexes[:3]" ] }, { "cell_type": "code", "execution_count": 36, "id": "71d82b9c", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6e2e84776c9842b8b7b9aca4b102ed87", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/99 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
source_idindex_termsubsidiaryyearvolpage_num
0sim_notes-and-queries_1851_4_indexB. (E. H.)on mazer-wood and sin-eaters,18514211
1sim_notes-and-queries_1851_4_indexSin-eatersnotices respecting,18514211
2sim_notes-and-queries_1852_6_indexE .)on the sin-eater,18526541
3sim_notes-and-queries_1852_6_indexI eeper ( (Alex.)on the sin-eater,18526541
4sim_notes-and-queries_1852_6_indexSymons Jelinger C.)origin of sin-eater,18526300
5sim_notes-and-queries_1870_6_indexSin-eaterorigin of the,18706450
6sim_notes-and-queries_1876_6_indexSin-eater18766505
7sim_notes-and-queries_1876_6_indexSin-eater18766505
8sim_notes-and-queries_1877_7_indexB. (A. C.)on the Sin-eater,1877714
9sim_notes-and-queries_1877_7_indexSin-eater1877714
10sim_notes-and-queries_1883_7_indexSin-eater1883725
11sim_notes-and-queries_1883_7_indexSin-eater18837334
12sim_notes-and-queries_1883_8_indexSin-eater18838255
13sim_notes-and-queries_1895_8_indexSin-eaters18958288
14sim_notes-and-queries_1895_8_indexSin-eaters18958288
15sim_notes-and-queries_1895_8_indexSin-eaters18958332
16sim_notes-and-queries_1896_9_indexSin-eaters18969169
17sim_notes-and-queries_1896_9_indexSin-eaters18969296
18sim_notes-and-queries_1896_9_indexSin-eaters18969110
19sim_notes-and-queries_1896_9_indexSin-eaters18969111
20sim_notes-and-queries_1896_9_indexSin-eaters18969109
21sim_notes-and-queries_1896_9_indexSin-eaters18969169
22sim_notes-and-queries_1896_9_indexSin-eaters18969236
23sim_notes-and-queries_1896_9_indexSin-eaters18969296
24sim_notes-and-queries_1896_9_indexSin-eaters18969111
25sim_notes-and-queries_1896_9_indexSin-eaters18969110
26sim_notes-and-queries_1896_9_indexOwen (J. P.)on sin-eaters,18969109
27sim_notes-and-queries_1896_9_indexOwen (J. P.)on sin-eaters,18969236
28sim_notes-and-queries_1896_9_indexSin-eaters18969110
29sim_notes-and-queries_1896_9_indexSin-eaters18969109
30sim_notes-and-queries_1896_9_indexSin-eaters18969169
31sim_notes-and-queries_1896_9_indexSin-eaters18969236
32sim_notes-and-queries_1896_9_indexSin-eaters18969296
33sim_notes-and-queries_1896_9_indexThomas (N. W.)on sin-eaters,18969169
\n", "" ], "text/plain": [ " source_id index_term \\\n", "0 sim_notes-and-queries_1851_4_index B. (E. H.) \n", "1 sim_notes-and-queries_1851_4_index Sin-eaters \n", "2 sim_notes-and-queries_1852_6_index E .) \n", "3 sim_notes-and-queries_1852_6_index I eeper ( (Alex.) \n", "4 sim_notes-and-queries_1852_6_index Symons Jelinger C.) \n", "5 sim_notes-and-queries_1870_6_index Sin-eater \n", "6 sim_notes-and-queries_1876_6_index Sin-eater \n", "7 sim_notes-and-queries_1876_6_index Sin-eater \n", "8 sim_notes-and-queries_1877_7_index B. (A. C.) \n", "9 sim_notes-and-queries_1877_7_index Sin-eater \n", "10 sim_notes-and-queries_1883_7_index Sin-eater \n", "11 sim_notes-and-queries_1883_7_index Sin-eater \n", "12 sim_notes-and-queries_1883_8_index Sin-eater \n", "13 sim_notes-and-queries_1895_8_index Sin-eaters \n", "14 sim_notes-and-queries_1895_8_index Sin-eaters \n", "15 sim_notes-and-queries_1895_8_index Sin-eaters \n", "16 sim_notes-and-queries_1896_9_index Sin-eaters \n", "17 sim_notes-and-queries_1896_9_index Sin-eaters \n", "18 sim_notes-and-queries_1896_9_index Sin-eaters \n", "19 sim_notes-and-queries_1896_9_index Sin-eaters \n", "20 sim_notes-and-queries_1896_9_index Sin-eaters \n", "21 sim_notes-and-queries_1896_9_index Sin-eaters \n", "22 sim_notes-and-queries_1896_9_index Sin-eaters \n", "23 sim_notes-and-queries_1896_9_index Sin-eaters \n", "24 sim_notes-and-queries_1896_9_index Sin-eaters \n", "25 sim_notes-and-queries_1896_9_index Sin-eaters \n", "26 sim_notes-and-queries_1896_9_index Owen (J. P.) \n", "27 sim_notes-and-queries_1896_9_index Owen (J. P.) \n", "28 sim_notes-and-queries_1896_9_index Sin-eaters \n", "29 sim_notes-and-queries_1896_9_index Sin-eaters \n", "30 sim_notes-and-queries_1896_9_index Sin-eaters \n", "31 sim_notes-and-queries_1896_9_index Sin-eaters \n", "32 sim_notes-and-queries_1896_9_index Sin-eaters \n", "33 sim_notes-and-queries_1896_9_index Thomas (N. W.) \n", "\n", " subsidiary year vol page_num \n", "0 on mazer-wood and sin-eaters, 1851 4 211 \n", "1 notices respecting, 1851 4 211 \n", "2 on the sin-eater, 1852 6 541 \n", "3 on the sin-eater, 1852 6 541 \n", "4 origin of sin-eater, 1852 6 300 \n", "5 origin of the, 1870 6 450 \n", "6 1876 6 505 \n", "7 1876 6 505 \n", "8 on the Sin-eater, 1877 7 14 \n", "9 1877 7 14 \n", "10 1883 7 25 \n", "11 1883 7 334 \n", "12 1883 8 255 \n", "13 1895 8 288 \n", "14 1895 8 288 \n", "15 1895 8 332 \n", "16 1896 9 169 \n", "17 1896 9 296 \n", "18 1896 9 110 \n", "19 1896 9 111 \n", "20 1896 9 109 \n", "21 1896 9 169 \n", "22 1896 9 236 \n", "23 1896 9 296 \n", "24 1896 9 111 \n", "25 1896 9 110 \n", "26 on sin-eaters, 1896 9 109 \n", "27 on sin-eaters, 1896 9 236 \n", "28 1896 9 110 \n", "29 1896 9 109 \n", "30 1896 9 169 \n", "31 1896 9 236 \n", "32 1896 9 296 \n", "33 on sin-eaters, 1896 9 169 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_term = \"sin eater\"\n", "\n", "q = f\"\"\"\n", "SELECT * FROM index_entries_fts\n", "WHERE index_entries_fts MATCH {db.quote(search_term)};\n", "\"\"\"\n", "\n", "read_sql(q, db.conn)[:50]" ] }, { "cell_type": "code", "execution_count": 38, "id": "87dd7b1a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
source_idindex_termsubsidiaryyearvolpage_num
0sim_notes-and-queries_1869_4_indexBoggarts and Feorin18694508
1sim_notes-and-queries_1869_4_indexBoggarts and Feorin18694508
2sim_notes-and-queries_1869_4_indexHigson (John)on Boggarts and Feorin,186945
3sim_notes-and-queries_1870_5_indexBoggarts and Feorin1870523
4sim_notes-and-queries_1870_5_indexBoggarts and Feorin18705156
5sim_notes-and-queries_1870_5_indexBoggarts and Feorin18705216
6sim_notes-and-queries_1870_5_indexBoggarts and Feorin18705287
7sim_notes-and-queries_1870_5_indexBoggarts and Feorin18705365
8sim_notes-and-queries_1870_5_indexBoggarts and Feorin18705517
9sim_notes-and-queries_1870_5_indexBowker (James)on Boggarts, Feorin, &c.18705365
10sim_notes-and-queries_1870_5_indexBoggartsFeorin,18705287
11sim_notes-and-queries_1870_5_indexDavies (Wm.)on Boggarts, Feorin, &c.,18705216
12sim_notes-and-queries_1870_5_indexHermentrudeon Boggarts,1870523
13sim_notes-and-queries_1870_5_indexBoggartsFeorin, &c.,18705156
14sim_notes-and-queries_1870_5_indexRiley (H. T.)on Boggarts, Feorin, &c.,18705216
15sim_notes-and-queries_1870_5_indexSmith (W. J. B.)on Boggarts, Feorin, &e.,18705317
16sim_notes-and-queries_1895_8_index: Boggart, 85ah by a sheep,18958170
17sim_notes-and-queries_1895_8_indexBoggart=ghost1895885
18sim_notes-and-queries_1895_8_indexBoggart=ghost18958255
\n", "
" ], "text/plain": [ " source_id index_term \\\n", "0 sim_notes-and-queries_1869_4_index Boggarts and Feorin \n", "1 sim_notes-and-queries_1869_4_index Boggarts and Feorin \n", "2 sim_notes-and-queries_1869_4_index Higson (John) \n", "3 sim_notes-and-queries_1870_5_index Boggarts and Feorin \n", "4 sim_notes-and-queries_1870_5_index Boggarts and Feorin \n", "5 sim_notes-and-queries_1870_5_index Boggarts and Feorin \n", "6 sim_notes-and-queries_1870_5_index Boggarts and Feorin \n", "7 sim_notes-and-queries_1870_5_index Boggarts and Feorin \n", "8 sim_notes-and-queries_1870_5_index Boggarts and Feorin \n", "9 sim_notes-and-queries_1870_5_index Bowker (James) \n", "10 sim_notes-and-queries_1870_5_index Boggarts \n", "11 sim_notes-and-queries_1870_5_index Davies (Wm.) \n", "12 sim_notes-and-queries_1870_5_index Hermentrude \n", "13 sim_notes-and-queries_1870_5_index Boggarts \n", "14 sim_notes-and-queries_1870_5_index Riley (H. T.) \n", "15 sim_notes-and-queries_1870_5_index Smith (W. J. B.) \n", "16 sim_notes-and-queries_1895_8_index : Boggart, 85 \n", "17 sim_notes-and-queries_1895_8_index Boggart=ghost \n", "18 sim_notes-and-queries_1895_8_index Boggart=ghost \n", "\n", " subsidiary year vol page_num \n", "0 1869 4 508 \n", "1 1869 4 508 \n", "2 on Boggarts and Feorin, 1869 4 5 \n", "3 1870 5 23 \n", "4 1870 5 156 \n", "5 1870 5 216 \n", "6 1870 5 287 \n", "7 1870 5 365 \n", "8 1870 5 517 \n", "9 on Boggarts, Feorin, &c. 1870 5 365 \n", "10 Feorin, 1870 5 287 \n", "11 on Boggarts, Feorin, &c., 1870 5 216 \n", "12 on Boggarts, 1870 5 23 \n", "13 Feorin, &c., 1870 5 156 \n", "14 on Boggarts, Feorin, &c., 1870 5 216 \n", "15 on Boggarts, Feorin, &e., 1870 5 317 \n", "16 ah by a sheep, 1895 8 170 \n", "17 1895 8 85 \n", "18 1895 8 255 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_term = \"boggart\"\n", "\n", "q = f\"\"\"\n", "SELECT * FROM index_entries_fts\n", "WHERE index_entries_fts MATCH {db.quote(search_term)};\n", "\"\"\"\n", "\n", "read_sql(q, db.conn)[:50]" ] }, { "cell_type": "markdown", "id": "e082de29", "metadata": {}, "source": [ "## Add the Index Tables to the Full Database\n", "\n", "Let's also add the index tables to our full database." ] }, { "cell_type": "code", "execution_count": 40, "id": "9adb9e8a", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7a7dada260b94a78acff4fb89aa165d9", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/99 [00:00