{ "cells": [ { "cell_type": "markdown", "id": "77c9ff0f-3524-4db5-ba2f-2e668e815a4d", "metadata": { "tags": [] }, "source": [ "# Tweak the Afifi cleaned file\n", "\n", "At the end of the Fusus workflow the OCRed Afifi edition is produced as a `.tsv` file.\n", "\n", "Cornelis has imported this file into Pandas, cleaned it, and saved it as a `.csv` file.\n", "In that process, the punctuation after words has gone missing.\n", "\n", "We reinsert that, based on the `.tsv` file.\n", "However, the amount of rows in both files is not equal, the cleaned file has ca. 8000 rows less.\n", "\n", "So, we have to look carefully which punctuation we are going to reinsert.\n", "\n", "We should not insert punctuation of the form `(` and `)`.\n", "\n", "## Actions\n", "\n", "1. The rows in the cleaned file are authoritative. Rows have been deleted from the original file for a reason.\n", "2. We will not add rows to the cleaned file.\n", "3. We compare rows in both files on the bases of the first columns: page, stripe, column, line, direction, left, top, right, bottom;\n", " we join these fields with a `,` and call that the *key* of the row.\n", "4. We perform sanity checks on the keys to guarantee that there is a 1-1 correspondence between keys and rows\n", "5. We write diagnostic files containing the keys that are in one file but not in the other (cleaned and original)\n", "6. We write a file with the non-empty, non-space punctuation in it, with an column with the word `added` if we have\n", " added the punctuation or `ignored` if we have ignored it\n", "7. We produce a tweaked cleaned file, AfifiTweaked.csv, based on AfifiCleaned.csv, with\n", " * the same rows, but each row having an extra field\n", " at the end with punctuation from a corresponding row in the original.\n", " * The value for column in the original is quite often the empty string,\n", " and has been converted to `0` in the cleaned file.\n", " We restore those zeroes to empty strings in the resulting tweaked file." ] }, { "cell_type": "code", "execution_count": 1, "id": "fc7f14dc-2fbb-4bb9-9f60-c734e77f0120", "metadata": {}, "outputs": [], "source": [ "import os\n", "import collections" ] }, { "cell_type": "code", "execution_count": 2, "id": "d5a50a29-a381-436e-9f3a-de3dda04f948", "metadata": {}, "outputs": [], "source": [ "BASE_DIR = os.path.expanduser(\"~/github/among/fusus\")\n", "\n", "CLEAN_FILE = f\"{BASE_DIR}/fusust-text-laboratory/AfifiCleaned.csv\"\n", "ORIG_FILE = f\"{BASE_DIR}/ur/Afifi/allpages.tsv\"\n", "TWEAK_FILE = f\"{BASE_DIR}/fusust-text-laboratory/AfifiTweaked.csv\"\n", "\n", "CLEAN_MISSING = f\"{BASE_DIR}/fusust-text-laboratory/AfifiDeletedRows.tsv\"\n", "ORIG_MISSING = f\"{BASE_DIR}/fusust-text-laboratory/AfifiNotFoundRows.tsv\"\n", "\n", "PUNC_ADDED = f\"{BASE_DIR}/fusust-text-laboratory/AfifiAddedPunc.tsv\"" ] }, { "cell_type": "markdown", "id": "d3d03df0-9011-46c1-ab5e-ff351d67cf26", "metadata": {}, "source": [ "# Read the original file and make an index\n", "\n", "We make an index with keys the combination of page, stripe, column, line, left, top, right, bottom fields,\n", "and as values tuples of the rest of the fields.\n", "\n", "We detect when multiple rows have the same keys.\n", "\n", "We do this for both the original and the cleaned file.\n", "\n", "Note that both files have different field separators, so we pass it.\n", "\n", "We pass `correct=2` to replace zeroes in column 2 by empty strings; we need it for the cleaned file.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "70565427-6566-4e3d-9489-03b64599c90d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: original file: There are 48871 keys\n", "OK: original file: No keys with multiple rows\n", "INFO: cleaned file: There are 40271 keys\n", "OK: cleaned file: No keys with multiple rows\n" ] } ], "source": [ "def makeIndex(path, label, sep, correct=None):\n", " rowIndex = {}\n", " duplicateKeys = {}\n", "\n", " with open(path) as fh:\n", " next(fh)\n", " for line in fh:\n", " fields = line.rstrip(\"\\n\").split(sep)\n", " if correct is not None:\n", " if fields[correct] == \"0\":\n", " fields[correct] = \"\"\n", " key = \",\".join(fields[0:8])\n", " value = tuple(fields[8:])\n", "\n", " if key in rowIndex:\n", " if key in duplicateKeys:\n", " duplicateKeys[key].append(value)\n", " else:\n", " duplicateKeys[key] = [rowIndex[key], value]\n", " rowIndex[key] = value\n", "\n", " print(f\"INFO: {label}: There are {len(rowIndex)} keys\")\n", "\n", " if duplicateKeys:\n", " print(f\"WARNING: {label}: There are {len(duplicateKeys)} keys with multiple rows\")\n", " else:\n", " print(f\"OK: {label}: No keys with multiple rows\")\n", " \n", " return rowIndex\n", " \n", "\n", "origRowIndex = makeIndex(ORIG_FILE, \"original file\", \"\\t\")\n", "cleanRowIndex = makeIndex(CLEAN_FILE, \"cleaned file\", \",\", correct=2)" ] }, { "cell_type": "markdown", "id": "90f62695-82fe-4a38-bc86-9dc86d028fa3", "metadata": {}, "source": [ "So far so good." ] }, { "cell_type": "markdown", "id": "d90fd295-c13f-42e1-9925-c9a0e730ab47", "metadata": { "tags": [] }, "source": [ "# Check for missing keys\n", "\n", "Report the cases where a key of the cleaned file cannot be found in the original file and vice versa" ] }, { "cell_type": "code", "execution_count": 4, "id": "727832e2-a7e1-4811-849a-fef45f2281d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OK: all cleaned keys are also original keys\n", "WARNING: 8600 original keys are not a cleaned key\n", "See /Users/dirk/github/among/fusus/fusust-text-laboratory/AfifiDeletedRows.tsv\n", "\n" ] } ], "source": [ "def checkIndex(sourceIndex, sourceLabel, targetIndex, targetLabel, path):\n", " \"\"\"Report the keys in targetIndex that are not in sourceIndex.\n", " \n", " Write the offending keys to path.\n", " \"\"\"\n", " \n", " n = 0\n", " \n", " with open(path, \"w\") as fh:\n", " fh.write(\"key\\tvalue\\n\")\n", " \n", " for key in targetIndex:\n", " if key not in sourceIndex:\n", " value = \",\".join(targetIndex[key])\n", " fh.write(f\"{key}\\t{value}\\n\")\n", " n += 1\n", " \n", " if n == 0:\n", " print(f\"OK: all {targetLabel} keys are also {sourceLabel} keys\")\n", " else:\n", " print(f\"WARNING: {n} {targetLabel} keys are not a {sourceLabel} key\")\n", " print(f\"See {path}\\n\")\n", " \n", " \n", "checkIndex(origRowIndex, \"original\", cleanRowIndex, \"cleaned\", ORIG_MISSING)\n", "checkIndex(cleanRowIndex, \"cleaned\", origRowIndex, \"original\", CLEAN_MISSING)" ] }, { "cell_type": "markdown", "id": "067d017c-2b91-476a-abb6-a847b7e1cf09", "metadata": {}, "source": [ "Good.\n", "\n", "We now know a lot of things:\n", "\n", "* We can identify rows by their keys, both in the original and in the cleaned files.\n", "* We find exactly one matching original row for each cleaned row. \n", "* We have an overview of all original rows that did not make it to the cleaned file: " ] }, { "cell_type": "markdown", "id": "4a64d864-0087-4318-a625-dd805d925a0d", "metadata": {}, "source": [ "# Produce the tweaked file\n", "\n", "We can now reliably add the punctuation row from the original to the cleaned row and put it in the tweaked file.\n", "\n", "We also produce a file that lists the non-empty added punctuation." ] }, { "cell_type": "code", "execution_count": 7, "id": "02f14dca-11a7-4c82-be5c-0e44f7765c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Punc fields\n", "\n", "40271 rows with:\n", "37872 times non-empty punctuation of which:\n", "924 times not a space of which:\n", "436 times ignored and replaced by a space\n", "\n" ] } ], "source": [ "twf = open(TWEAK_FILE, \"w\")\n", "twf.write(\"page,stripe,column,line,left,top,right,bottom,confidence,letters,punc\\n\")\n", "\n", "taf = open(PUNC_ADDED, \"w\")\n", "taf.write(\"key\\tpunc\\tletters\\torigletters\\n\")\n", "\n", "nTotal = 0\n", "nNonEmpty = 0\n", "nNonWhite = 0\n", "nIgnored = 0\n", "\n", "for (key, value) in cleanRowIndex.items():\n", " origValue = origRowIndex[key]\n", " \n", " (confidence, letters) = value[0:2]\n", " (origConfidence, origLetters, punc) = origValue[0:3]\n", " \n", " if \"(\" in punc or \")\" in punc:\n", " ignored = True\n", " puncRep = punc.replace(\"(\", \"\").replace(\")\", \"\")\n", " else:\n", " ignored = False\n", " puncRep = punc\n", " twf.write(f\"{key},{confidence},{letters},{puncRep}\\n\")\n", " nTotal += 1\n", " \n", " if punc:\n", " nNonEmpty += 1\n", " if punc != \" \":\n", " nNonWhite += 1\n", " if ignored:\n", " nIgnored += 1\n", " ignoredRep = \"ignored\"\n", " else:\n", " ignored = False\n", " ignoredRep = \"added\"\n", " taf.write(f\"{key}\\t{ignoredRep}\\t{punc},{letters},{origLetters}\\n\")\n", " \n", "twf.close()\n", "taf.close()\n", "\n", "print(f\"\"\"Punc fields\n", "\n", "{nTotal} rows with:\n", "{nNonEmpty} times non-empty punctuation of which:\n", "{nNonWhite} times not a space of which:\n", "{nIgnored} times ignored and replaced by a space\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "906bda07-9b1a-4680-ab2f-2864b522f93e", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }