# Tweak the Afifi cleaned file

At the end of the Fusus workflow the OCRed Afifi edition is produced as a `.tsv` file.

Cornelis has imported this file into Pandas, cleaned it, and saved it as a `.csv` file.
In that process, the punctuation after words has gone missing.

We reinsert that, based on the `.tsv` file.
However, the amount of rows in both files is not equal, the cleaned file has ca. 8000 rows less.

So, we have to look carefully which punctuation we are going to reinsert.

We should not insert punctuation of the form `(` and `)`.

## Actions

1. The rows in the cleaned file are authoritative. Rows have been deleted from the original file for a reason.
2. We will not add rows to the cleaned file.
3. We compare rows in both files on the bases of the first columns: page, stripe, column, line, direction, left, top, right, bottom;
 we join these fields with a `,` and call that the *key* of the row.
4. We perform sanity checks on the keys to guarantee that there is a 1-1 correspondence between keys and rows
5. We write diagnostic files containing the keys that are in one file but not in the other (cleaned and original)
6. We write a file with the non-empty, non-space punctuation in it, with an column with the word `added` if we have
 added the punctuation or `ignored` if we have ignored it
7. We produce a tweaked cleaned file, AfifiTweaked.csv, based on AfifiCleaned.csv, with
 * the same rows, but each row having an extra field
 at the end with punctuation from a corresponding row in the original.
 * The value for column in the original is quite often the empty string,
 and has been converted to `0` in the cleaned file.
 We restore those zeroes to empty strings in the resulting tweaked file.

In [1]:
import os
import collections

In [2]:
BASE_DIR = os.path.expanduser("~/github/among/fusus")

CLEAN_FILE = f"{BASE_DIR}/fusust-text-laboratory/AfifiCleaned.csv"
ORIG_FILE = f"{BASE_DIR}/ur/Afifi/allpages.tsv"
TWEAK_FILE = f"{BASE_DIR}/fusust-text-laboratory/AfifiTweaked.csv"

CLEAN_MISSING = f"{BASE_DIR}/fusust-text-laboratory/AfifiDeletedRows.tsv"
ORIG_MISSING = f"{BASE_DIR}/fusust-text-laboratory/AfifiNotFoundRows.tsv"

PUNC_ADDED = f"{BASE_DIR}/fusust-text-laboratory/AfifiAddedPunc.tsv"

# Read the original file and make an index

We make an index with keys the combination of page, stripe, column, line, left, top, right, bottom fields,
and as values tuples of the rest of the fields.

We detect when multiple rows have the same keys.

We do this for both the original and the cleaned file.

Note that both files have different field separators, so we pass it.

We pass `correct=2` to replace zeroes in column 2 by empty strings; we need it for the cleaned file.


In [3]:
def makeIndex(path, label, sep, correct=None):
 rowIndex = {}
 duplicateKeys = {}

 with open(path) as fh:
 next(fh)
 for line in fh:
 fields = line.rstrip("\n").split(sep)
 if correct is not None:
 if fields[correct] == "0":
 fields[correct] = ""
 key = ",".join(fields[0:8])
 value = tuple(fields[8:])

 if key in rowIndex:
 if key in duplicateKeys:
 duplicateKeys[key].append(value)
 else:
 duplicateKeys[key] = [rowIndex[key], value]
 rowIndex[key] = value

 print(f"INFO: {label}: There are {len(rowIndex)} keys")

 if duplicateKeys:
 print(f"WARNING: {label}: There are {len(duplicateKeys)} keys with multiple rows")
 else:
 print(f"OK: {label}: No keys with multiple rows")
 
 return rowIndex
 

origRowIndex = makeIndex(ORIG_FILE, "original file", "\t")
cleanRowIndex = makeIndex(CLEAN_FILE, "cleaned file", ",", correct=2)

INFO: original file: There are 48871 keys
OK: original file: No keys with multiple rows
INFO: cleaned file: There are 40271 keys
OK: cleaned file: No keys with multiple rows


So far so good.

# Check for missing keys

Report the cases where a key of the cleaned file cannot be found in the original file and vice versa

In [4]:
def checkIndex(sourceIndex, sourceLabel, targetIndex, targetLabel, path):
 """Report the keys in targetIndex that are not in sourceIndex.
 
 Write the offending keys to path.
 """
 
 n = 0
 
 with open(path, "w") as fh:
 fh.write("key\tvalue\n")
 
 for key in targetIndex:
 if key not in sourceIndex:
 value = ",".join(targetIndex[key])
 fh.write(f"{key}\t{value}\n")
 n += 1
 
 if n == 0:
 print(f"OK: all {targetLabel} keys are also {sourceLabel} keys")
 else:
 print(f"WARNING: {n} {targetLabel} keys are not a {sourceLabel} key")
 print(f"See {path}\n")
 
 
checkIndex(origRowIndex, "original", cleanRowIndex, "cleaned", ORIG_MISSING)
checkIndex(cleanRowIndex, "cleaned", origRowIndex, "original", CLEAN_MISSING)

OK: all cleaned keys are also original keys
See /Users/dirk/github/among/fusus/fusust-text-laboratory/AfifiDeletedRows.tsv



Good.

We now know a lot of things:

* We can identify rows by their keys, both in the original and in the cleaned files.
* We find exactly one matching original row for each cleaned row. 
* We have an overview of all original rows that did not make it to the cleaned file: 

# Produce the tweaked file

We can now reliably add the punctuation row from the original to the cleaned row and put it in the tweaked file.

We also produce a file that lists the non-empty added punctuation.

In [7]:
twf = open(TWEAK_FILE, "w")
twf.write("page,stripe,column,line,left,top,right,bottom,confidence,letters,punc\n")

taf = open(PUNC_ADDED, "w")
taf.write("key\tpunc\tletters\torigletters\n")

nTotal = 0
nNonEmpty = 0
nNonWhite = 0
nIgnored = 0

for (key, value) in cleanRowIndex.items():
 origValue = origRowIndex[key]
 
 (confidence, letters) = value[0:2]
 (origConfidence, origLetters, punc) = origValue[0:3]
 
 if "(" in punc or ")" in punc:
 ignored = True
 puncRep = punc.replace("(", "").replace(")", "")
 else:
 ignored = False
 puncRep = punc
 twf.write(f"{key},{confidence},{letters},{puncRep}\n")
 nTotal += 1
 
 if punc:
 nNonEmpty += 1
 if punc != " ":
 nNonWhite += 1
 if ignored:
 nIgnored += 1
 ignoredRep = "ignored"
 else:
 ignored = False
 ignoredRep = "added"
 taf.write(f"{key}\t{ignoredRep}\t{punc},{letters},{origLetters}\n")
 
twf.close()
taf.close()

print(f"""Punc fields

{nTotal} rows with:
{nNonEmpty} times non-empty punctuation of which:
{nNonWhite} times not a space of which:
{nIgnored} times ignored and replaced by a space
""")

Punc fields

40271 rows with:
37872 times non-empty punctuation of which:
924 times not a space of which:
436 times ignored and replaced by a space

