{ "cells": [ { "cell_type": "markdown", "id": "cb10925c-cccb-420e-be3b-b5ca49ad5cf5", "metadata": { "tags": [] }, "source": [ "# Identifying 'odd' characters for feature 'after' (N1904LFT)" ] }, { "cell_type": "markdown", "id": "b9e178c9-7abb-46cf-b4d4-8d38a5985bf2", "metadata": { "tags": [] }, "source": [ "## Table of content \n", "* 1 - Introduction]\n", "* 2 - Load Text-Fabric app and data\n", "* 3 - Performing the queries\n", " * 3.1 - Showing the issue\n", " * 3.2 - Setting up a query to find them\n", " * 3.3 - Explanation of the regular expression\n", " * 3.4 - Bug" ] }, { "cell_type": "markdown", "id": "549033e8-b844-4504-8017-bd4a389e1164", "metadata": {}, "source": [ "# 1 - Introduction \n", "##### [Back to TOC](#TOC)\n", "\n", "This Jupyter Notebook investigates the pressense of 'odd' values for feature 'after'. " ] }, { "cell_type": "markdown", "id": "01f65e07-00ae-4099-892e-6dcfeecd6663", "metadata": {}, "source": [ "# 2 - Load Text-Fabric app and data \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 2, "id": "51782023-07ce-4923-b46a-3ed0fd2b8a12", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 3, "id": "a1afe711-fc3a-49c7-a3a8-6d889c0adc0e", "metadata": {}, "outputs": [], "source": [ "# Loading the New Testament TextFabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment.\n", "\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 4, "id": "29ff4a94-c84d-4011-9dd8-4acfd3f4a845", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Status: latest release online v03 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The requested data is not available offline\n", "\t~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 not found\n" ] }, { "data": { "text/html": [ "Status: latest release online v03 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.30s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 3.07s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.01s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.58s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.70s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.57s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.57s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | | 0.08s C __levels__ from otype, oslots, otext\n", " | | 1.79s C __order__ from otype, oslots, __levels__\n", " | | 0.08s C __rank__ from otype, __order__\n", " | | 4.63s C __levUp__ from otype, oslots, __rank__\n", " | | 2.70s C __levDown__ from otype, __levUp__, __rank__\n", " | | 0.06s C __characters__ from otext\n", " | | 1.19s C __boundary__ from otype, oslots, __rank__\n", " | | 0.05s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n", " | | 0.26s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse\n", " | 0.54s T appos from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.58s T book_long from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.50s T booknumber from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.57s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.55s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.47s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.66s T containedclause from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.50s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.65s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.54s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.72s T id from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.48s T junction from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.63s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.58s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.60s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.52s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.51s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.59s T morph from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.61s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.68s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.56s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.56s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.50s T orig_order from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.51s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.75s T ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.57s T roleclausedistance from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.55s T rule from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.51s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.58s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.57s T sp_full from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.61s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.51s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.51s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.53s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.71s T unicode from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.52s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.55s T wgclass from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.48s T wglevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.51s T wgnum from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.50s T wgrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.50s T wgrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.53s T wgtype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.07s T wordgroup from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.57s T wordlevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.56s T wordrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n", " | 0.58s T wordrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3\n" ] }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.10, tonyjurg/Nestle1904LFT/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904LFT 0.3, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book275102.93100
chapter260529.92100
verse794317.35100
sentence1216011.33100
wg1324606.59633
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904 (LowFat Tree)\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " Characters (eg. punctuations) following the word\n", "\n", "
\n", "\n", "
\n", "
\n", "appos\n", "
\n", "
str
\n", "\n", " Apposition details\n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " Book\n", "\n", "
\n", "\n", "
\n", "
\n", "book_long\n", "
\n", "
str
\n", "\n", " Book name (fully spelled out)\n", "\n", "
\n", "\n", "
\n", "
\n", "booknumber\n", "
\n", "
int
\n", "\n", " NT book number (Matthew=1, Mark=2, ..., Revelation=27)\n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " Book name (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " Chapter number inside book\n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " Clause type details\n", "\n", "
\n", "\n", "
\n", "
\n", "containedclause\n", "
\n", "
str
\n", "\n", " Contained clause (WG number)\n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " Degree (e.g. Comparitative, Superlative)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " English gloss\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " Gramatical gender (Masculine, Feminine, Neuter)\n", "\n", "
\n", "\n", "
\n", "
\n", "id\n", "
\n", "
str
\n", "\n", " id of the word\n", "\n", "
\n", "\n", "
\n", "
\n", "junction\n", "
\n", "
str
\n", "\n", " Junction data related to a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " Lexeme (lemma)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " Lauw-Nida lexical classification (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " Monad (currently: order of words in XML tree file!)\n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " Gramatical mood of the verb (passive, etc)\n", "\n", "
\n", "\n", "
\n", "
\n", "morph\n", "
\n", "
str
\n", "\n", " Morphological tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " Node ID (as in the XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " Surface word stripped of punctations\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " Gramatical number (Singular, Plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " Gramatical number of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "orig_order\n", "
\n", "
int
\n", "\n", " Word order within corpus (per book)\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " Gramatical person of the verb (first, second, third)\n", "\n", "
\n", "\n", "
\n", "
\n", "ref\n", "
\n", "
str
\n", "\n", " ref Id\n", "\n", "
\n", "\n", "
\n", "
\n", "roleclausedistance\n", "
\n", "
str
\n", "\n", " distance to wordgroup defining the role of this word\n", "\n", "
\n", "\n", "
\n", "
\n", "rule\n", "
\n", "
str
\n", "\n", " Wordgroup rule information \n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " Sentence number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " Part of Speech (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp_full\n", "
\n", "
str
\n", "\n", " Part of Speech (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " Strongs number\n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " Subject reference (to nodeID in XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " Gramatical tense of the verb (e.g. Present, Aorist)\n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " Gramatical type of noun or pronoun (e.g. Common, Personal)\n", "\n", "
\n", "\n", "
\n", "
\n", "unicode\n", "
\n", "
str
\n", "\n", " Word as it arears in the text in Unicode (incl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " Verse number inside chapter\n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " Gramatical voice of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "wgclass\n", "
\n", "
str
\n", "\n", " Class of the wordgroup ()\n", "\n", "
\n", "\n", "
\n", "
\n", "wglevel\n", "
\n", "
int
\n", "\n", " number of parent wordgroups for a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "wgnum\n", "
\n", "
int
\n", "\n", " Wordgroup number (counted per book)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrole\n", "
\n", "
str
\n", "\n", " Role of the wordgroup (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrolelong\n", "
\n", "
str
\n", "\n", " Role of the wordgroup (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgtype\n", "
\n", "
str
\n", "\n", " Wordgroup type details\n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " Word as it appears in the text (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordgroup\n", "
\n", "
int
\n", "\n", " Wordgroup number (counted per book)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordlevel\n", "
\n", "
str
\n", "\n", " number of parent wordgroups for a word\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrole\n", "
\n", "
str
\n", "\n", " Role of the word (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrolelong\n", "
\n", "
str
\n", "\n", " Role of the word (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the app and data\n", "N1904 = use (\"tonyjurg/Nestle1904LFT:latest\", hoist=globals())" ] }, { "cell_type": "markdown", "id": "4b1bf471-6511-4fd9-8bb8-116379da307f", "metadata": { "tags": [] }, "source": [ "# 3 - Performing the queries \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "62bc2c34-bb57-46bf-96f3-3ff6f43c9eee", "metadata": {}, "source": [ "## 3.1 - Showing the issue \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "754e51a0-0f07-4262-ac18-e96f81160d48", "metadata": {}, "source": [ "The following shows the pressence of a few 'odd' cases for feature 'after':" ] }, { "cell_type": "code", "execution_count": 31, "id": "748067ec-15ac-4080-9fbc-65c97b8cce2b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "frequency: ((' ', 119271), (',', 9443), ('.', 5717), ('·', 2355), (';', 970), ('—', 7), ('ε', 3), ('ς', 3), ('ὶ', 2), ('ί', 1), ('α', 1), ('ι', 1), ('χ', 1), ('ἱ', 1), ('ὁ', 1), ('ὰ', 1), ('ὸ', 1))\n" ] } ], "source": [ "result = F.after.freqList()\n", "print ('frequency: {0}'.format(result))" ] }, { "cell_type": "markdown", "id": "86138709-f6c0-433a-aaab-db6aa079e33a", "metadata": {}, "source": [ "## 3.2 - Setting up a query to find them \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 51, "id": "15d731b8-ac0e-4427-a998-29a2997d72b6", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.11s 16 results\n", "╒═════════════════════╤══════════════╤═════════╕\n", "│ location │ word │ after │\n", "╞═════════════════════╪══════════════╪═════════╡\n", "│ Luke 23:51 │ —οὗτο │ ς │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Luke 2:35 │ —κα │ ὶ │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ John 4:2 │ —καίτοιγ │ ε │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ John 7:22 │ —οὐ │ χ │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Acts 22:2 │ —ἀκούσαντε │ ς │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Romans 15:25 │ —νυν │ ὶ │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ I_Corinthians 9:15 │ —τ │ ὸ │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ II_Corinthians 12:2 │ —ἁρπαγέντ │ α │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ II_Corinthians 12:2 │ —εἴτ │ ε │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ II_Corinthians 12:3 │ —εἴτ │ ε │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ II_Corinthians 6:2 │ —λέγε │ ι │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Galatians 2:6 │ —ὁποῖο │ ί │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Ephesians 5:10 │ —δοκιμάζοντε │ ς │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Ephesians 5:9 │ — │ ὁ │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Hebrews 7:20 │ —ο │ ἱ │\n", "├─────────────────────┼──────────────┼─────────┤\n", "│ Hebrews 7:22 │ —κατ │ ὰ │\n", "╘═════════════════════╧══════════════╧═════════╛\n" ] } ], "source": [ "# Library to format table\n", "from tabulate import tabulate\n", "\n", "# The actual query\n", "SearchOddAfters = '''\n", "word after~^(?!([\\s\\.·—,;]))\n", " '''\n", "OddAfterList = N1904.search(SearchOddAfters)\n", "\n", "# Postprocess the query results\n", "Results=[]\n", "for tuple in OddAfterList:\n", " node=tuple[0]\n", " location=\"{} {}:{}\".format(F.book.v(node),F.chapter.v(node),F.verse.v(node))\n", " result=(location,F.word.v(node),F.after.v(node))\n", " Results.append(result)\n", " \n", "# Produce the table\n", "headers = [\"location\",\"word\",\"after\"]\n", "print(tabulate(Results, headers=headers, tablefmt='fancy_grid'))" ] }, { "cell_type": "markdown", "id": "d1a89bf7-1aa6-4762-99ab-cf2e677946ec", "metadata": {}, "source": [ "## 3.3 - Explanation of the regular expression \n", "##### [Back to TOC](#TOC)\n", "\n", "The regular expression broken down in its components:\n", "\n", "`^`: This symbol is called a caret and represents the start of a string. It ensures that the following pattern is applied at the beginning of the string.\n", "\n", "`(?!...)`: This is a negative lookahead assertion. It checks if the pattern inside the parentheses does not match at the current position.\n", "\n", "`[…]`: This denotes a character class, which matches any single character that is within the brackets.\n", "\n", "`[\\s\\.·,—,;]`: This character class contains multiple characters enclosed in the brackets. Let's break down the characters within it:\n", "\n", "* `\\s`: This is a shorthand character class that matches any whitespace character, including spaces, tabs, and newlines.\n", "* `\\.`: This matches a literal period (dot).\n", "* `·`: This matches a specific Unicode character, which is a middle dot.\n", "* `—`: This matches an em dash character.\n", "* `,`: This matches a comma.\n", "* `;`: This matches a semicolon.\n", "In summary, the character class `[\\s\\.·,—,;]` matches any single character that is either a whitespace character, a period, a middle dot, an em dash, a comma, or a semicolon.\n", "\n", "The regular expression selects any string which does not starts with a whitespace character, period, middle dot, em dash, comma, or semicolon." ] }, { "cell_type": "markdown", "id": "a5e4bdf3-b108-4c6a-b99b-a4bf4723afee", "metadata": {}, "source": [ "The following site can be used to build and verify a regular expression: [regex101.com](https://regex101.com/) (choose the 'Pyton flavor') " ] }, { "cell_type": "markdown", "id": "3854f61c", "metadata": {}, "source": [ "## 3.4 - Bug \n", "##### [Back to TOC](#TOC)\n", "\n", "The observed behaviour was due to a bug. [Issue tracker #76](https://github.com/Clear-Bible/macula-greek/issues/76) was opened. When the text of a node starts with punctuation, the @after attribute contains the last character of the word. This is a bug in the transformation to XML LowFat Tree data." ] }, { "cell_type": "code", "execution_count": null, "id": "118862e2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }