{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Checks\n", "\n", "We check the correctness of the conversion of Abegg's data files to TF.\n", "\n", "In this notebook we concentrate on the main fields in the data files:\n", "\n", "* transcription `fullo`\n", "* language/lexeme `lang` and `lexo`\n", "* morphology `morpho`\n", "\n", "and we'll keep track of the source location: biblical or non-biblical file, line number in the file.\n", "\n", "We show that all this material has been transferred to TF completely and faithfully." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "import yaml\n", "\n", "from tf.app import use\n", "\n", "from checksLib import Compare\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using TF-app in /Users/dirk/github/annotation/app-dss/code:\n", "\trepo clone offline under ~/github (local github)\n", "Using data in /Users/dirk/github/etcbc/dss/tf/0.5:\n", "\trepo clone offline under ~/github (local github)\n", "Using data in /Users/dirk/github/etcbc/dss/parallels/tf/0.5:\n", "\trepo clone offline under ~/github (local github)\n" ] }, { "data": { "text/html": [ "Documentation: DSS Character table Feature docs dss API Text-Fabric API 7.7.5 Search Reference
Loaded features:\n", "

Dead Sea Scrolls: after alt biblical book chapter cl cl2 cor fragment full fulle fullo glex glexe glexo glyph glyphe glypho gn gn2 gn3 halfverse intl lang lex lexe lexo line md merr morpho nu nu2 nu3 otype ps ps2 ps3 punc punce punco rec rem script scroll sp srcLn st type unc vac verse vs vt occ oslots

Parallel Passages: sim

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
API members:\n", "C Computed, Call AllComputeds, Cs ComputedString
\n", "E Edge, Eall AllEdges, Es EdgeString
\n", "ensureLoaded, TF, ignored, loadLog
\n", "L Locality
\n", "cache, error, indent, info, reset
\n", "N Nodes, sortKey, sortKeyTuple, otypeRank, sortNodes
\n", "F Feature, Fall AllFeatures, Fs FeatureString
\n", "S Search
\n", "T Text
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/dss:clone\", checkout=\"clone\", hoist=globals(), silent=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "We compare the material in the source files with the `o`-style features of the TF dataset.\n", "The `o`-style features `fullo`, `lexo`, `morpho` contain the unmodified strings corresponding to\n", "fields in the lines of the source files. we add the `lang` feature to the mix.\n", "\n", "We'll compile two lists of this material, one based directly on the source files, and one based on the TF\n", "features.\n", "\n", "Both lists consist of tuples, one for each word, and inside each tuple we also\n", "store whether the word comes from the biblical or non-biblical file and what the line number is.\n", "\n", "Then we'll compare the tuples of both lists one by one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1\n", "We determine the node of the first word in the biblical source file." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1889878" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ln = T.nodeFromSection((\"1Q1\", \"f1\", \"1\"))\n", "words = L.d(ln, otype=\"word\")\n", "firstBibWord = words[0]\n", "firstBibWord" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2\n", "\n", "We determine the words for which the feature `biblical` is 2. These are the words\n", "that occur in both source files.\n", "\n", "We have chosen to retain the biblical entries of these words, and ignore the non biblical entries.\n", "\n", "So, when we are going to compare the source material and the TF material, we have to leave out these\n", "words from the non-biblical part of the source material.\n", "The non-biblical version turned out\n", "to be either equal to the biblical version, or it had no material and the biblical version has a reconstruction.\n", "\n", "In order to do that, we make a set of the lines involved, marked by their scroll, fragment and line number." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'2Q29 f1:1',\n", " '2Q29 f1:2',\n", " '2Q29 f1:3',\n", " '4Q249j f1:1',\n", " '4Q249j f1:2',\n", " '4Q249j f1:3',\n", " '4Q249j f1:4',\n", " '4Q249j f1:5',\n", " '4Q249j f1:6',\n", " '4Q483 f1:1',\n", " '4Q483 f1:2',\n", " '4Q483 f1:3',\n", " '4Q483 f2:1',\n", " '4Q483 f2:2'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bib2Lines = {\n", " \"{} {}:{}\".format(*T.sectionFromNode(ln))\n", " for ln in F.otype.s(\"line\")\n", " if F.biblical.v(ln) == 2\n", "}\n", "bib2Lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 3\n", "\n", "Build the list based on TF: `wordsTf`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "wordsTf = []\n", "\n", "for w in F.otype.s(\"word\"):\n", " biblical = F.biblical.v(w)\n", " bib = biblical in {1, 2}\n", " wordsTf.append(\n", " (\n", " bib,\n", " F.srcLn.v(w),\n", " F.fullo.v(w),\n", " F.lang.v(w) or \"\",\n", " F.lexo.v(w) or \"\",\n", " F.morpho.v(w) or \"\",\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We sort the words by source file first and then by source line numbers" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "500995" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordsTf.sort(key=lambda x: (x[0], x[1]))\n", "len(wordsTf)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(False, 4, 'w', '', 'w◊', 'Pc'),\n", " (False, 5, 'oth', '', 'oAt;Dh', 'Pd'),\n", " (False, 6, 'Cmow', '', 'vmo', 'vqvmp'),\n", " (False, 7, 'kl', '', 'k;Ol', 'ncmsc'),\n", " (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordsTf[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 4\n", "\n", "Build the list according to the source files.\n", "\n", "We have applied fixes during conversion. We should apply the same fixes here." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "FIXES_DECL = os.path.expanduser(\"~/github/etcbc/dss/yaml/fixes.yaml\")\n", "\n", "\n", "def readYaml(fileName):\n", " with open(fileName) as y:\n", " y = yaml.load(y)\n", " return y\n", "\n", "\n", "fixesDecl = readYaml(FIXES_DECL)\n", "\n", "lineFixes = fixesDecl[\"lineFixes\"]\n", "fieldFixes = fixesDecl[\"fieldFixes\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We read the source files and apply line fixes." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nonbib line 256841 fixed:\n", "\t4Q491 f36:2,4.1 [\\\\] \\\\\\@0\n", "\t4Q491 f36:2,4.1 [\\\\] \\\\\\@0\n", "\n", "nonbib line 348565 fixed:\n", "\t11Q19 2:1,2.1 -- \\0\n", "\t11Q19 2:1,2.1 -- \\@0\n", "\n", "nonbib line 348900 fixed:\n", "\t11Q19 3:13,3,1 -- \\@0\n", "\t11Q19 3:13,3.1 -- \\@0\n", "\n", "bib line 36238 fixed:\n", "\tIs 44:21\t1Q8 19:1\t[\\ \\\\\\@0\t\t21829\n", "\tIs 44:21\t1Q8 19:1\t[\\\t\\\\\\@0\t21829\n", "\n", "bib line 99010 fixed:\n", "\tDeut 33:29\t4Q29 f10:2\t--\t\t2895\n", "\tDeut 33:29\t4Q29 f10:2\t--\t\\@0\t2895\n", "\n", "bib line 143765 fixed:\n", "\tIs 56:2\t4Q56 f48:3\t--\t\t30427\n", "\tIs 56:2\t4Q56 f48:3\t--\t\\@0\t30427\n", "\n", "bib line 186962 fixed:\n", "\tDan 2:10\t4Q112 f1ii:3\tl|]\tl\\\\%@0\t516\n", "\tDan 2:10\t4Q112 f1ii:3\tl|]\tl\\\\%0\t516\n", "\n", "bib line 208179 fixed:\n", "\t8Q3 f12_16:17\t8Q3 f12_16:17\t--\t\\@0\t\t949\n", "\t8Q3 f12_16:17\t8Q3 f12_16:17\t--\t\\@0\t949\n", "\n", "bib line 217582 fixed:\n", "\tPs 135:9\t11Q5 14:17\t--\t\t11023\n", "\tPs 135:9\t11Q5 14:17\t--\t\\@0\t11023\n", "\n" ] } ], "source": [ "sourceDir = os.path.expanduser(\"~/local/dss/sanitized\")\n", "bibSource = \"dss_bib\"\n", "nonbibSource = \"dss_nonbib\"\n", "sources = (\"nonbib\", \"bib\")\n", "sourceLines = {}\n", "for src in sources:\n", " biblical = src == \"bib\"\n", " lineFix = lineFixes[biblical]\n", "\n", " srcPath = f\"{sourceDir}/dss_{src}.txt\"\n", " with open(srcPath) as fh:\n", " sourceLines[src] = list(fh)\n", " for (i, line) in enumerate(sourceLines[src]):\n", " ln = i + 1\n", " if ln in lineFix:\n", " (fr, to, expl) = lineFix[ln]\n", " if fr in line:\n", " oline = line\n", " line = line.replace(fr, to)\n", " sourceLines[src][i] = line\n", " print(f\"{src} line {ln} fixed:\\n\\t{oline}\\t{line}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 5\n", "\n", "We split the lines into fields and apply the field fixes.\n", "\n", "Not all lines in the source correspond to words.\n", "\n", "If a line does not have word material, it is not a word.\n", "We skip these lines.\n", "\n", "We remember whether a material is in Greek.\n", "\n", "Some source lines contain an escape character.\n", "We call those lines control lines.\n", "If the line contains `(f0)`, it is in Greek, together with subsequent lines.\n", "Greek terminates at `(fy)`.\n", "\n", "We also skip the words from the non-biblical file that also have an entry in the biblical file.\n", "These are the words occurring in the lines\n", "we collected in `bib2Lines` in step 2.\n", "\n", "Furthermore, we must treat a transcription of the form `]`*d*`[` as a line number, not a real transcription,\n", "so we have to skip these lines as well. Here *d* is any decimal number." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nonbib line 38512 field trans fixed:\n", "\t≤]\t≥≤\n", "nonbib line 48129 field morph fixed:\n", "\tvhp3cpX3mp{2}\tvhp3cp{2}X3mp\n", "nonbib line 59593 field trans fixed:\n", "\t ± \t±\n", "nonbib line 127763 field morph fixed:\n", "\tvhp3cpX3ms{2}\tvhp3cp{2}X3ms\n", "nonbib line 153845 field trans fixed:\n", "\tb]\tb\n", "nonbib line 153970 field trans fixed:\n", "\tb]\tb\n", "nonbib line 154026 field trans fixed:\n", "\tb]\tb\n", "nonbib line 173512 field trans fixed:\n", "\t^b\t^b^\n", "nonbib line 211343 field trans fixed:\n", "\ty»tkwØ_nw\ty»tkwØnw\n", "nonbib line 248844 field trans fixed:\n", "\tt_onh]\ttonh]\n", "nonbib line 263123 field lex fixed:\n", "\t82\tkj\n", "nonbib line 287243 field trans fixed:\n", "\toyN_\toyN\n", "nonbib line 290592 field trans fixed:\n", "\ta\tA\n", "nonbib line 291886 field trans fixed:\n", "\ta\tA\n", "nonbib line 324473 field trans fixed:\n", "\t[˝w»b|a|]\t[w»b|a|]\n", "nonbib line 335846 field trans fixed:\n", "\t3\t\n", "bib line 48768 field morph fixed:\n", "\tvp12ms\tvp1ms\n", "bib line 109489 field morph fixed:\n", "\t0ncfp\tncfp\n", "bib line 115544 field morph fixed:\n", "\t\\\t0\n", "bib line 124566 field lex fixed:\n", "\tjll-2\tjll_2\n", "bib line 146637 field morph fixed:\n", "\t0ncfs\tncfs\n", "bib line 147953 field trans fixed:\n", "\t[^≥\t[≥\n", "bib line 154933 field trans fixed:\n", "\t≥1a≤\t≥a≤\n", "bib line 154949 field trans fixed:\n", "\t≥2a≤\t≥a≤\n", "bib line 157840 field morph fixed:\n", "\t2\t0\n", "bib line 158371 field morph fixed:\n", "\t4\t0\n", "bib line 158401 field morph fixed:\n", "\t3\t0\n", "bib line 158493 field trans fixed:\n", "\t[\\\\]^\t[\\\\]\n", "bib line 185650 field trans fixed:\n", "\th«\\\\wØ(\th«\\\\wØ\n", "bib line 186373 field morph fixed:\n", "\tPp@0\tPp\n", "bib line 202206 field trans fixed:\n", "\talwhiM\talwhyM\n", "500995 lines, 113 word lines skipped\n" ] } ], "source": [ "wordlessRe = re.compile(r\"^[\\\\\\[\\]≤≥?{}<>()\\^]*$\")\n", "isNumber = re.compile(r\"\\][0-9]+\\[$\")\n", "\n", "wordsSrc = []\n", "\n", "skippedWordLines = []\n", "\n", "for src in sources:\n", " bib = src == \"bib\"\n", " fieldFix = fieldFixes[bib]\n", " sep = \"\\t\" if bib else \" \"\n", " greek = False\n", " for (i, line) in enumerate(sourceLines[src]):\n", " if \"\\u001b\" in line:\n", " if \"(f0)\" in line:\n", " greek = True\n", " elif \"(fy)\" in line:\n", " greek = False\n", " continue\n", " fields = line.rstrip(\"\\n\").split(sep)\n", " nFields = len(fields)\n", " ln = i + 1\n", " if nFields < 3:\n", " continue\n", " if not bib:\n", " scroll = fields[0]\n", " label = fields[1].split(\",\")[0]\n", " passage = f\"{scroll} {label}\"\n", " if passage in bib2Lines:\n", " skippedWordLines.append(ln)\n", " continue\n", " word = fields[2]\n", " lex = fields[3] if nFields >= 4 else \"\"\n", " lang = \"\"\n", " parts = lex.split(\"@\", maxsplit=1)\n", " if len(parts) > 1:\n", " (lex, morph) = parts\n", " else:\n", " parts = lex.split(\"%\", maxsplit=1)\n", " if len(parts) > 1:\n", " (lex, morph) = parts\n", " lang = \"a\"\n", " else:\n", " morph = \"\"\n", "\n", " if ln in fieldFix:\n", " for (field, (fr, to, expl)) in fieldFix[ln].items():\n", " iVal = (\n", " word\n", " if field == \"trans\"\n", " else lex\n", " if field == \"lex\"\n", " else morph\n", " if field == \"morph\"\n", " else None\n", " )\n", " if iVal == fr:\n", " if field == \"trans\":\n", " word = to\n", " elif field == \"lex\":\n", " lex = to\n", " elif field == \"morph\":\n", " morph = to\n", " print(f\"{src} line {ln} field {field} fixed:\\n\\t{iVal}\\t{to}\")\n", "\n", " if (\n", " word == \"/\" or wordlessRe.match(word) or isNumber.match(word)\n", " ) and lex == \"\":\n", " continue\n", " theLang = \"g\" if greek else lang\n", " wordsSrc.append((bib, i + 1, word, theLang, lex, morph))\n", "print(f\"{len(wordsSrc)} lines, {len(skippedWordLines)} word lines skipped\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(False, 4, 'w', '', 'w◊', 'Pc'),\n", " (False, 5, 'oth', '', 'oAt;Dh', 'Pd'),\n", " (False, 6, 'Cmow', '', 'vmo', 'vqvmp'),\n", " (False, 7, 'kl', '', 'k;Ol', 'ncmsc'),\n", " (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordsSrc[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 6\n", "\n", "The comparison.\n", "In the companion module `checksLib.py` we have defined a few handy functions." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "CC = Compare(sourceLines, wordsSrc, A.api, wordsTf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We demonstrate a few functions that help with the comparison.\n", "\n", "We need to peek into the source files, at a line number with some context." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " B16: Gen 1:20 ┃1Q1 f1:1 ┃w ┃w◊@Pc ┃41.5 \n", " B17: Gen 1:20 ┃1Q1 f1:1 ┃yamr[ ┃amr_1@vqw3ms ┃42 \n", ">>> B18: Gen 1:20 ┃1Q1 f1:1 ┃/ ┃ ┃54 \n", " B19: Gen 1:20 ┃1Q1 f1:2 ┃]alhyM ┃aTløhIyM@ncmp ┃55 \n", " B20: Gen 1:20 ┃1Q1 f1:2 ┃yC[rwxw ┃vrX@vqi3mp ┃56 \n" ] } ], "source": [ "CC.showSrc(True, 18)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `showTf` looks up a line number in TF." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " B16: Gen 1:20 ┃1Q1 f1:1 ┃w ┃ ┃w◊ ┃Pc┃1889893┃\n", " B17: Gen 1:20 ┃1Q1 f1:1 ┃yamr[ ┃ ┃amr_1 ┃vqw3ms┃1889894┃\n", ">>> B18: no nodes\n", " B19: Gen 1:20 ┃1Q1 f1:2 ┃]alhyM ┃ ┃aTløhIyM ┃ncmp┃1889895┃\n", " B20: Gen 1:20 ┃1Q1 f1:2 ┃yC[rwxw ┃ ┃vrX ┃vqi3mp┃1889896┃\n" ] } ], "source": [ "CC.showTf(True, 18)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And `showDiff` combines `firstDiff` and `showSrc` and `showTf` to get a meaningful display of the first difference,\n", "as we'll see later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 7\n", "\n", "Now we can go comparing!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EQUAL\n" ] } ], "source": [ "CC.showDiff()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 8\n", "\n", "That's easily said. We can compare the two lists very transparently as follows:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordsSrc == wordsTf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consciously distort something, and run the comparison again." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[False, 258361, 'm|\\\\]', '', 'm\\\\\\\\', '0']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nr = 200000\n", "item = list(wordsSrc[nr])\n", "item" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "item[3] = \"a\"\n", "wordsSrc[nr] = tuple(item)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "item 200000:\n", "TF N258361 m|\\] ┃ ┃m\\\\ ┃0 \n", "SRC N258361 m|\\] ┃a ┃m\\\\ ┃0 \n", "TF:\n", " N258360: 4Q496 f20:2 ┃[\\\\ ┃ ┃\\\\\\ ┃0┃1807071┃\n", ">>> N258361: 4Q496 f20:2 ┃m|\\] ┃ ┃m\\\\ ┃0┃1807072┃\n", " N258362: 4Q496 f20:2 ┃-- ┃ ┃\\ ┃0┃1807073┃\n", "SRC:\n", " N258360: 4Q496 ┃f20:2,3.1 ┃[\\\\ ┃\\\\\\@0 \n", ">>> N258361: 4Q496 ┃f20:2,4.1 ┃m|\\] ┃m\\\\@0 \n", " N258362: 4Q496 ┃f20:2,5.1 ┃-- ┃\\@0 \n" ] } ], "source": [ "CC.showDiff()" ] } ], "metadata": { "jupytext": { "encoding": "# -*- coding: utf-8 -*-" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }