# Checks

We check the correctness of the conversion of Abegg's data files to TF.

In this notebook we concentrate on the main fields in the data files:

* transcription `fullo`
* language/lexeme `lang` and `lexo`
* morphology `morpho`

and we'll keep track of the source location: biblical or non-biblical file, line number in the file.

We show that all this material has been transferred to TF completely and faithfully.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import re
import yaml

from tf.app import use

from checksLib import Compare


In [3]:
A = use("ETCBC/dss:clone", checkout="clone", hoist=globals(), silent=False)

Using TF-app in /Users/dirk/github/annotation/app-dss/code:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/dss/tf/0.5:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/dss/parallels/tf/0.5:
	repo clone offline under ~/github (local github)


# Overview

We compare the material in the source files with the `o`-style features of the TF dataset.
The `o`-style features `fullo`, `lexo`, `morpho` contain the unmodified strings corresponding to
fields in the lines of the source files. we add the `lang` feature to the mix.

We'll compile two lists of this material, one based directly on the source files, and one based on the TF
features.

Both lists consist of tuples, one for each word, and inside each tuple we also
store whether the word comes from the biblical or non-biblical file and what the line number is.

Then we'll compare the tuples of both lists one by one.

## Step 1
We determine the node of the first word in the biblical source file.

In [4]:
ln = T.nodeFromSection(("1Q1", "f1", "1"))
words = L.d(ln, otype="word")
firstBibWord = words[0]
firstBibWord

1889878

# Step 2

We determine the words for which the feature `biblical` is 2. These are the words
that occur in both source files.

We have chosen to retain the biblical entries of these words, and ignore the non biblical entries.

So, when we are going to compare the source material and the TF material, we have to leave out these
words from the non-biblical part of the source material.
The non-biblical version turned out
to be either equal to the biblical version, or it had no material and the biblical version has a reconstruction.

In order to do that, we make a set of the lines involved, marked by their scroll, fragment and line number.

In [5]:
bib2Lines = {
    "{} {}:{}".format(*T.sectionFromNode(ln))
    for ln in F.otype.s("line")
    if F.biblical.v(ln) == 2
}
bib2Lines

{'2Q29 f1:1',
 '2Q29 f1:2',
 '2Q29 f1:3',
 '4Q249j f1:1',
 '4Q249j f1:2',
 '4Q249j f1:3',
 '4Q249j f1:4',
 '4Q249j f1:5',
 '4Q249j f1:6',
 '4Q483 f1:1',
 '4Q483 f1:2',
 '4Q483 f1:3',
 '4Q483 f2:1',
 '4Q483 f2:2'}

# Step 3

Build the list based on TF: `wordsTf`.

In [6]:
wordsTf = []

for w in F.otype.s("word"):
    biblical = F.biblical.v(w)
    bib = biblical in {1, 2}
    wordsTf.append(
        (
            bib,
            F.srcLn.v(w),
            F.fullo.v(w),
            F.lang.v(w) or "",
            F.lexo.v(w) or "",
            F.morpho.v(w) or "",
        )
    )

We sort the words by source file first and then by source line numbers

In [7]:
wordsTf.sort(key=lambda x: (x[0], x[1]))
len(wordsTf)

500995

In [8]:
wordsTf[0:5]

[(False, 4, 'w', '', 'w◊', 'Pc'),
 (False, 5, 'oth', '', 'oAt;Dh', 'Pd'),
 (False, 6, 'Cmow', '', 'vmo', 'vqvmp'),
 (False, 7, 'kl', '', 'k;Ol', 'ncmsc'),
 (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]

# Step 4

Build the list according to the source files.

We have applied fixes during conversion. We should apply the same fixes here.

In [24]:
FIXES_DECL = os.path.expanduser("~/github/etcbc/dss/yaml/fixes.yaml")


def readYaml(fileName):
    with open(fileName) as y:
        y = yaml.load(y)
    return y


fixesDecl = readYaml(FIXES_DECL)

lineFixes = fixesDecl["lineFixes"]
fieldFixes = fixesDecl["fieldFixes"]

We read the source files and apply line fixes.

In [25]:
sourceDir = os.path.expanduser("~/local/dss/sanitized")
bibSource = "dss_bib"
nonbibSource = "dss_nonbib"
sources = ("nonbib", "bib")
sourceLines = {}
for src in sources:
    biblical = src == "bib"
    lineFix = lineFixes[biblical]

    srcPath = f"{sourceDir}/dss_{src}.txt"
    with open(srcPath) as fh:
        sourceLines[src] = list(fh)
    for (i, line) in enumerate(sourceLines[src]):
        ln = i + 1
        if ln in lineFix:
            (fr, to, expl) = lineFix[ln]
            if fr in line:
                oline = line
                line = line.replace(fr, to)
                sourceLines[src][i] = line
                print(f"{src} line {ln} fixed:\n\t{oline}\t{line}")

nonbib line 256841 fixed:
	4Q491 f36:2,4.1 [\\]  \\\@0
	4Q491 f36:2,4.1 [\\] \\\@0

nonbib line 348565 fixed:
	11Q19 2:1,2.1 -- \0
	11Q19 2:1,2.1 -- \@0

nonbib line 348900 fixed:
	11Q19 3:13,3,1 -- \@0
	11Q19 3:13,3.1 -- \@0

bib line 36238 fixed:
	Is 44:21	1Q8 19:1	[\ \\\@0		21829
	Is 44:21	1Q8 19:1	[\	\\\@0	21829

bib line 99010 fixed:
	Deut 33:29	4Q29 f10:2	--		2895
	Deut 33:29	4Q29 f10:2	--	\@0	2895

bib line 143765 fixed:
	Is 56:2	4Q56 f48:3	--		30427
	Is 56:2	4Q56 f48:3	--	\@0	30427

bib line 186962 fixed:
	Dan 2:10	4Q112 f1ii:3	l|]	l\\%@0	516
	Dan 2:10	4Q112 f1ii:3	l|]	l\\%0	516

bib line 208179 fixed:
	8Q3 f12_16:17	8Q3 f12_16:17	--	\@0		949
	8Q3 f12_16:17	8Q3 f12_16:17	--	\@0	949

bib line 217582 fixed:
	Ps 135:9	11Q5 14:17	--		11023
	Ps 135:9	11Q5 14:17	--	\@0	11023



# Step 5

We split the lines into fields and apply the field fixes.

Not all lines in the source correspond to words.

If a line does not have word material, it is not a word.
We skip these lines.

We remember whether a material is in Greek.

Some source lines contain an escape character.
We call those lines control lines.
If the line contains `(f0)`, it is in Greek, together with subsequent lines.
Greek terminates at `(fy)`.

We also skip the words from the non-biblical file that also have an entry in the biblical file.
These are the words occurring in the lines
we collected in `bib2Lines` in step 2.

Furthermore, we must treat a transcription of the form `]`*d*`[` as a line number, not a real transcription,
so we have to skip these lines as well. Here *d* is any decimal number.

In [26]:
wordlessRe = re.compile(r"^[\\\[\]≤≥?{}<>()\^]*$")
isNumber = re.compile(r"\][0-9]+\[$")

wordsSrc = []

skippedWordLines = []

for src in sources:
    bib = src == "bib"
    fieldFix = fieldFixes[bib]
    sep = "\t" if bib else " "
    greek = False
    for (i, line) in enumerate(sourceLines[src]):
        if "\u001b" in line:
            if "(f0)" in line:
                greek = True
            elif "(fy)" in line:
                greek = False
            continue
        fields = line.rstrip("\n").split(sep)
        nFields = len(fields)
        ln = i + 1
        if nFields < 3:
            continue
        if not bib:
            scroll = fields[0]
            label = fields[1].split(",")[0]
            passage = f"{scroll} {label}"
            if passage in bib2Lines:
                skippedWordLines.append(ln)
                continue
        word = fields[2]
        lex = fields[3] if nFields >= 4 else ""
        lang = ""
        parts = lex.split("@", maxsplit=1)
        if len(parts) > 1:
            (lex, morph) = parts
        else:
            parts = lex.split("%", maxsplit=1)
            if len(parts) > 1:
                (lex, morph) = parts
                lang = "a"
            else:
                morph = ""

        if ln in fieldFix:
            for (field, (fr, to, expl)) in fieldFix[ln].items():
                iVal = (
                    word
                    if field == "trans"
                    else lex
                    if field == "lex"
                    else morph
                    if field == "morph"
                    else None
                )
                if iVal == fr:
                    if field == "trans":
                        word = to
                    elif field == "lex":
                        lex = to
                    elif field == "morph":
                        morph = to
                    print(f"{src} line {ln} field {field} fixed:\n\t{iVal}\t{to}")

        if (
            word == "/" or wordlessRe.match(word) or isNumber.match(word)
        ) and lex == "":
            continue
        theLang = "g" if greek else lang
        wordsSrc.append((bib, i + 1, word, theLang, lex, morph))
print(f"{len(wordsSrc)} lines, {len(skippedWordLines)} word lines skipped")

nonbib line 38512 field trans fixed:
	≤]	≥≤
nonbib line 48129 field morph fixed:
	vhp3cpX3mp{2}	vhp3cp{2}X3mp
nonbib line 59593 field trans fixed:
	 ± 	±
nonbib line 127763 field morph fixed:
	vhp3cpX3ms{2}	vhp3cp{2}X3ms
nonbib line 153845 field trans fixed:
	b]	b
nonbib line 153970 field trans fixed:
	b]	b
nonbib line 154026 field trans fixed:
	b]	b
nonbib line 173512 field trans fixed:
	^b	^b^
nonbib line 211343 field trans fixed:
	y»tkwØ_nw	y»tkwØnw
nonbib line 248844 field trans fixed:
	t_onh]	tonh]
nonbib line 263123 field lex fixed:
	82	kj
nonbib line 287243 field trans fixed:
	oyN_	oyN
nonbib line 290592 field trans fixed:
	a	A
nonbib line 291886 field trans fixed:
	a	A
nonbib line 324473 field trans fixed:
	[˝w»b|a|]	[w»b|a|]
nonbib line 335846 field trans fixed:
	3	
bib line 48768 field morph fixed:
	vp12ms	vp1ms
bib line 109489 field morph fixed:
	0ncfp	ncfp
bib line 115544 field morph fixed:
	\	0
bib line 124566 field lex fixed:
	jll-2	jll_2
bib line 146637 field morph fixed

In [27]:
wordsSrc[0:5]

[(False, 4, 'w', '', 'w◊', 'Pc'),
 (False, 5, 'oth', '', 'oAt;Dh', 'Pd'),
 (False, 6, 'Cmow', '', 'vmo', 'vqvmp'),
 (False, 7, 'kl', '', 'k;Ol', 'ncmsc'),
 (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]

# Step 6

The comparison.
In the companion module `checksLib.py` we have defined a few handy functions.

In [28]:
CC = Compare(sourceLines, wordsSrc, A.api, wordsTf)

We demonstrate a few functions that help with the comparison.

We need to peek into the source files, at a line number with some context.

In [29]:
CC.showSrc(True, 18)

    B16: Gen 1:20       ┃1Q1 f1:1       ┃w              ┃w◊@Pc          ┃41.5           
    B17: Gen 1:20       ┃1Q1 f1:1       ┃yamr[          ┃amr_1@vqw3ms   ┃42             
>>> B18: Gen 1:20       ┃1Q1 f1:1       ┃/              ┃               ┃54             
    B19: Gen 1:20       ┃1Q1 f1:2       ┃]alhyM         ┃aTløhIyM@ncmp  ┃55             
    B20: Gen 1:20       ┃1Q1 f1:2       ┃yC[rwxw        ┃vrX@vqi3mp     ┃56             


The function `showTf` looks up a line number in TF.

In [30]:
CC.showTf(True, 18)

    B16: Gen 1:20       ┃1Q1 f1:1       ┃w              ┃ ┃w◊             ┃Pc┃1889893┃
    B17: Gen 1:20       ┃1Q1 f1:1       ┃yamr[          ┃ ┃amr_1          ┃vqw3ms┃1889894┃
>>> B18: no nodes
    B19: Gen 1:20       ┃1Q1 f1:2       ┃]alhyM         ┃ ┃aTløhIyM       ┃ncmp┃1889895┃
    B20: Gen 1:20       ┃1Q1 f1:2       ┃yC[rwxw        ┃ ┃vrX            ┃vqi3mp┃1889896┃


And `showDiff` combines `firstDiff` and `showSrc` and `showTf` to get a meaningful display of the first difference,
as we'll see later.

# Step 7

Now we can go comparing!

In [31]:
CC.showDiff()

EQUAL


# Step 8

That's easily said. We can compare the two lists very transparently as follows:

In [32]:
wordsSrc == wordsTf

True

Let's consciously distort something, and run the comparison again.

In [33]:
nr = 200000
item = list(wordsSrc[nr])
item

[False, 258361, 'm|\\]', '', 'm\\\\', '0']

In [34]:
item[3] = "a"
wordsSrc[nr] = tuple(item)

In [35]:
CC.showDiff()

item 200000:
TF  N258361 m|\]           ┃               ┃m\\            ┃0              
SRC N258361 m|\]           ┃a              ┃m\\            ┃0              
TF:
    N258360: 4Q496 f20:2    ┃[\\            ┃ ┃\\\            ┃0┃1807071┃
>>> N258361: 4Q496 f20:2    ┃m|\]           ┃ ┃m\\            ┃0┃1807072┃
    N258362: 4Q496 f20:2    ┃--             ┃ ┃\              ┃0┃1807073┃
SRC:
    N258360: 4Q496          ┃f20:2,3.1      ┃[\\            ┃\\\@0          
>>> N258361: 4Q496          ┃f20:2,4.1      ┃m|\]           ┃m\\@0          
    N258362: 4Q496          ┃f20:2,5.1      ┃--             ┃\@0            
