{ "cells": [ { "cell_type": "markdown", "id": "e8b09e8e-7ffb-4fb8-86ee-16b1f2bf0f80", "metadata": {}, "source": [ "# Verify 'BOL' against 'LFT'" ] }, { "cell_type": "markdown", "id": "1a22bdfd-2f51-4383-b2a9-a78818e969ae", "metadata": {}, "source": [ "In order to be BOL features to be compatible with the LFT Text-Fabric version, the node numbers for node type 'word' need to match exactly. This script will check this by comparing feature normalized between the two datasets." ] }, { "cell_type": "code", "execution_count": 71, "id": "93a208bc-058f-45ae-81cf-2371e9158f4f", "metadata": {}, "outputs": [], "source": [ "# Following variables should contain the relative path and name of the two files to compare\n", "LFTFile=\"../tf/0.5/normalized.tf\"\n", "BOLFile=\"BOL/normalized.tf\"\n", "targetWord=\"Βιβλος\" # word to sync both files upon\n", "# How many difference to show\n", "NumberExamples = 10" ] }, { "cell_type": "code", "execution_count": 68, "id": "b1005f2a-dd1c-45a0-ab4c-6282b9486a5a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comparing file ../tf/0.5/normalized.tf with BOL/normalized.tf \n", "\n", "Result:\n", "\n", "Starting at line 20 in file 1 at: 'Βίβλος\\n'\n", "Starting at line 14 in file 2 at: 'Βίβλος\\n'\n", "mismatch at monad 83369 : 'θεός\\n' versus 'Θεός\\n'\n", "Finished.\n" ] } ], "source": [ "import os\n", "from unidecode import unidecode\n", "import unicodedata\n", "item1=item2=''\n", "\n", "def remove_accents(text):\n", " return ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')\n", "\n", "def compare_files(file1_path, file2_path):\n", " global targetWord\n", " global NumberExamples\n", " global item1\n", " global item2\n", " FoundDifferences=0\n", " with open(file1_path, 'r', encoding='utf-8') as file1, open(file2_path, 'r', encoding='utf-8') as file2:\n", "\n", " # Skip part of file2 until target word is found\n", " lineNumber1=0\n", " for line1 in file1:\n", " lineNumber1+=1\n", " unaccentedWord=remove_accents(line1.strip())\n", " if targetWord in unaccentedWord:\n", " print ('Starting at line ',lineNumber1,' in file 1 at:',repr(line1))\n", " break\n", "\n", " # Skip part of file2 until target word is found\n", " lineNumber2=0\n", " for line2 in file2:\n", " lineNumber2+=1\n", " unaccentedWord=remove_accents(line2.strip())\n", " if targetWord in unaccentedWord:\n", " print ('Starting at line ',lineNumber2,' in file 2 at:',repr(line2))\n", " break\n", "\n", " monad=0\n", " \n", " # Compare the remaining contents of both files\n", " for line1, line2 in zip(file1, file2):\n", " monad+=1\n", " if remove_accents(line1.strip()) != remove_accents(line2.strip()):\n", " print ('mismatch at monad', monad, ':',repr(line1), ' versus ', repr(line2))\n", " # store them\n", " item1=line1\n", " item2=line2\n", " \n", " print(\"Finished.\")\n", "\n", "# main part\n", "#First check if the file exist, then check its content\n", "if os.path.exists(LFTFile):\n", " if os.path.exists(BOLFile):\n", " print (\"Comparing file \",LFTFile,\" with \",BOLFile,\"\\n\\nResult:\\n\\n\",end=\"\") \n", " compare_files(LFTFile, BOLFile)\n", " else:\n", " print (f\"Could not find file {BOLFile}.\")\n", "else:\n", " print(f\"Could not find file {LFTFile}.\")\n" ] }, { "cell_type": "markdown", "id": "b85a8698-6461-472e-a892-44c05fe1ca0f", "metadata": {}, "source": [ "## Check where this difference is found" ] }, { "cell_type": "code", "execution_count": 59, "id": "17739aa7-f21a-406c-847d-dec3c7566bd2", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 60, "id": "c9baf06a-d4e7-4410-899d-f849f7bf8511", "metadata": {}, "outputs": [], "source": [ "# Loading the Text-Fabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment.\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 63, "id": "0228ef2f-20d0-4ae0-b3b3-8d38a63ce4c6", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.10, tonyjurg/Nestle1904LFT/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904LFT 0.5, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book275102.93100
chapter260529.92100
verse794317.35100
sentence801117.20100
wg1134477.58624
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " Characters (eg. punctuations) following the word\n", "\n", "
\n", "\n", "
\n", "
\n", "appos\n", "
\n", "
str
\n", "\n", " Apposition details\n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " Book name\n", "\n", "
\n", "\n", "
\n", "
\n", "booknumber\n", "
\n", "
int
\n", "\n", " NT book number (Matthew=1, Mark=2, ..., Revelation=27)\n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " Book name (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " Chapter number inside book\n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " Clause type details\n", "\n", "
\n", "\n", "
\n", "
\n", "containedclause\n", "
\n", "
str
\n", "\n", " Contained clause (WG number)\n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " Degree (e.g. Comparitative, Superlative)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " English gloss\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " Gramatical gender (Masculine, Feminine, Neuter)\n", "\n", "
\n", "\n", "
\n", "
\n", "junction\n", "
\n", "
str
\n", "\n", " Junction data related to a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " Lexeme (lemma)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " Lauw-Nida lexical classification (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "markafter\n", "
\n", "
str
\n", "\n", " Text critical marker after word\n", "\n", "
\n", "\n", "
\n", "
\n", "markbefore\n", "
\n", "
str
\n", "\n", " Text critical marker before word\n", "\n", "
\n", "\n", "
\n", "
\n", "markorder\n", "
\n", "
str
\n", "\n", " Order of punctuation and text critical marker\n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " Monad (word order in the corpus)\n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " Gramatical mood of the verb (passive, etc)\n", "\n", "
\n", "\n", "
\n", "
\n", "morph\n", "
\n", "
str
\n", "\n", " Morphological tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " Node ID (as in the XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " Surface word with accents normalized and trailing punctuations removed\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " Gramatical number (Singular, Plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " Gramatical number of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "orig_order\n", "
\n", "
int
\n", "\n", " Word order (in source XML file)\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " Gramatical person of the verb (first, second, third)\n", "\n", "
\n", "\n", "
\n", "
\n", "punctuation\n", "
\n", "
str
\n", "\n", " Punctuation after word\n", "\n", "
\n", "\n", "
\n", "
\n", "ref\n", "
\n", "
str
\n", "\n", " ref ID\n", "\n", "
\n", "\n", "
\n", "
\n", "roleclausedistance\n", "
\n", "
str
\n", "\n", " Distance to wordgroup defining the role of this word\n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " Sentence number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " Part of Speech (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp_full\n", "
\n", "
str
\n", "\n", " Part of Speech (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " Strongs number\n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " Subject reference (to nodeID in XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " Gramatical tense of the verb (e.g. Present, Aorist)\n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " Gramatical type of noun or pronoun (e.g. Common, Personal)\n", "\n", "
\n", "\n", "
\n", "
\n", "unicode\n", "
\n", "
str
\n", "\n", " Word as it arears in the text in Unicode (incl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " Verse number inside chapter\n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " Gramatical voice of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "wgclass\n", "
\n", "
str
\n", "\n", " Class of the wordgroup ()\n", "\n", "
\n", "\n", "
\n", "
\n", "wglevel\n", "
\n", "
int
\n", "\n", " Number of parent wordgroups for a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "wgnum\n", "
\n", "
int
\n", "\n", " Wordgroup number (counted per book)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrole\n", "
\n", "
str
\n", "\n", " Role of the wordgroup (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrolelong\n", "
\n", "
str
\n", "\n", " Role of the wordgroup (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrule\n", "
\n", "
str
\n", "\n", " Wordgroup rule information\n", "\n", "
\n", "\n", "
\n", "
\n", "wgtype\n", "
\n", "
str
\n", "\n", " Wordgroup type details\n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " Word as it appears in the text (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordlevel\n", "
\n", "
str
\n", "\n", " Number of parent wordgroups for a word\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrole\n", "
\n", "
str
\n", "\n", " Role of the word (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrolelong\n", "
\n", "
str
\n", "\n", " Role of the word (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordtranslit\n", "
\n", "
str
\n", "\n", " Transliteration of the text (in latin letters, excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordunacc\n", "
\n", "
str
\n", "\n", " Word without accents (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the N1904GBI app and data\n", "# Since two distinct Text-Fabric dataset are loaed, the option hoist=globals() SHOULD NOT be used!\n", "N1904GBI = use (\"tonyjurg/Nestle1904LFT\",version='0.5', hoist=globals())" ] }, { "cell_type": "code", "execution_count": 67, "id": "3c173115-34ee-4e50-9883-61b3896cc585", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Romans', 1, 19)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.sectionFromNode(83369)" ] }, { "cell_type": "code", "execution_count": 65, "id": "51c93830-af66-4fa3-a7ef-45d36f74ac21", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(137785, 137924, 150868)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.sectionTuple(83369)" ] }, { "cell_type": "code", "execution_count": 66, "id": "4dbc295e-b838-417b-a3d8-515686102b6b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'διότι τὸ γνωστὸν τοῦ Θεοῦ φανερόν ἐστιν ἐν αὐτοῖς· ὁ θεὸς γὰρ αὐτοῖς ἐφανέρωσεν. '" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text(150868)" ] }, { "cell_type": "markdown", "id": "f5f76bbf-03f2-43c7-af46-879270d57c0e", "metadata": {}, "source": [ "## Dig a litle deeper" ] }, { "cell_type": "code", "execution_count": 69, "id": "41617fda-c747-4ffe-ae91-bc14d028733c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Character: 'θ'\tUnicode Code Point: 952\n", "Character: 'ε'\tUnicode Code Point: 949\n", "Character: 'ό'\tUnicode Code Point: 972\n", "Character: 'ς'\tUnicode Code Point: 962\n", "Character: '\n", "'\tUnicode Code Point: 10\n" ] } ], "source": [ "for char in item1:\n", " print(f\"Character: '{char}'\\tUnicode Code Point: {ord(char)}\")" ] }, { "cell_type": "code", "execution_count": 70, "id": "4a4fafbd-71c9-45e2-be58-773afa988a0b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Character: 'Θ'\tUnicode Code Point: 920\n", "Character: 'ε'\tUnicode Code Point: 949\n", "Character: 'ό'\tUnicode Code Point: 8057\n", "Character: 'ς'\tUnicode Code Point: 962\n", "Character: '\n", "'\tUnicode Code Point: 10\n" ] } ], "source": [ "for char in item2:\n", " print(f\"Character: '{char}'\\tUnicode Code Point: {ord(char)}\")" ] }, { "cell_type": "markdown", "id": "e586ae62-4667-4bf0-a241-e02522121fa3", "metadata": {}, "source": [ "Since the comparison is performed on the unaccented word, the problem seems to be the use of a different unicode value for θ." ] }, { "cell_type": "markdown", "id": "969eb3af-999a-4f6e-9943-4df38d923fa6", "metadata": {}, "source": [ "## Other invisable differences between the tf files\n", "\n", "There were found to be differences in regards to special characters between the tf files:" ] }, { "cell_type": "markdown", "id": "f49d65e0-62c1-4556-a43c-d756b8d8fa1a", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "83e9120b-d22d-491a-9353-8716e8b44aa0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }