{ "cells": [ { "cell_type": "code", "execution_count": 20, "id": "ac7c005d-e0d9-493c-80c6-60423d5ae9ed", "metadata": {}, "outputs": [], "source": [ "import collections\n", "import json\n", "from collatex import collate\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 2, "id": "d8934f39-9883-4784-9c3e-89d53acaa4d0", "metadata": {}, "outputs": [], "source": [ "BASE = \"~/github/among/fusus\"\n", "VERSION = \"0.7\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "6d247b52-bac4-4789-a3b2-241e7c76812c", "metadata": {}, "outputs": [], "source": [ "LK = \"LK\"\n", "AF = \"AF\"\n", "\n", "EDITIONS = {\n", " LK: \"Lakhnawi\",\n", " AF: \"Afifi\",\n", "}\n", "\n", "A = {}\n", "F = {}\n", "maxSlot = {}" ] }, { "cell_type": "code", "execution_count": 4, "id": "56918365-efcb-4efa-a22e-7b2ddb29015f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "data: ~/github/among/fusus/tf/Lakhnawi/0.7" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 9.1.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "27 features found and 0 ignored\n" ] }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 9.1.3, no app configured
Data: among/fusus/tf/Lakhnawi/0.7
Features:
among/fusus/tf/Lakhnawiboxb
boxl
boxr
boxt
dir
fass
letters
lettersn
lettersp
letterst
ln
lwcvl
n
np
otype
poetrymeter
poetryverse
punc
punca
puncb
puncba
qunawims
quran
raw
title
oslots
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/among/fusus/tf/Afifi/0.7" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 9.1.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "17 features found and 0 ignored\n" ] }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 9.1.3, no app configured
Data: among/fusus/tf/Afifi/0.7
Features:
among/fusus/tf/Afifib
boxb
boxl
boxr
boxt
confidence
letters
lettersn
lettersp
letterst
ln
n
otype
punc
punca
oslots
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "{'LK': 40379, 'AF': 40271}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for (acro, name) in EDITIONS.items():\n", " A[acro] = use(f\"among/fusus/tf/{name}:clone\", writing=\"ara\", version=VERSION)\n", " F[acro] = A[acro].api.F\n", " maxSlot[acro] = F[acro].otype.maxSlot\n", "maxSlot" ] }, { "cell_type": "code", "execution_count": 5, "id": "f8c11a47-93fb-4607-a86a-99aeca0208d6", "metadata": {}, "outputs": [], "source": [ "getTextLK = F[LK].lettersn.v\n", "getTextAF = F[AF].lettersn.v\n", "maxLK = maxSlot[LK]\n", "maxAF = maxSlot[AF]" ] }, { "cell_type": "markdown", "id": "b29c08da-2c18-4905-ac40-bf450e2b57ec", "metadata": {}, "source": [ "# Exploring\n", "\n", "First a small example." ] }, { "cell_type": "code", "execution_count": 6, "id": "93ba9c0b-780d-4f76-a202-d2d5489c6428", "metadata": {}, "outputs": [], "source": [ "tokensLK = [dict(t=f\"{getTextLK(slot)} \", s=slot) for slot in range(1, 10)]\n", "tokensAF = [dict(t=f\"{getTextAF(slot)} \", s=slot) for slot in range(1, 10)]\n", "\n", "data = dict(\n", " witnesses=[\n", " dict(id=LK, tokens=tokensLK),\n", " dict(id=AF, tokens=tokensAF),\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "id": "8685fcf3-c604-4f3c-9683-769656287105", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Run collatex\n", " 0.01s Done\n" ] } ], "source": [ "A[LK].indent(reset=True)\n", "A[LK].info(\"Run collatex\")\n", "result = collate(data, output=\"json\", segmentation=False, near_match=True)\n", "resultAscii = collate(data, output=\"table\", segmentation=False, near_match=True)\n", "A[LK].info(\"Done\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "a6e9e4b2-b375-4de6-8533-8a1d58b6f247", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+-----------+-------+-------+-----+------+-------+-----+------+-------+--------+--------+\n", "| LK | - | - | ālḥmd | llh | mnzl | ālḥkm | ʿlá | ḳlwb | ālklm | bāḥdyŧ | ālṭryḳ |\n", "| AF | bnzlylālʿ | ylrʿā | ālḥmd | lh | mnzl | ālḥk | ʿlá | ḳlwb | ālklm | - | - |\n", "+----+-----------+-------+-------+-----+------+-------+-----+------+-------+--------+--------+\n" ] } ], "source": [ "print(resultAscii)" ] }, { "cell_type": "code", "execution_count": 9, "id": "772b669f-48b5-49b6-b631-9e60f26413ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[None, None, [{'_sigil': 'LK', '_token_array_position': 0, 's': 1, 't': 'ālḥmd '}], [{'_sigil': 'LK', '_token_array_position': 1, 's': 2, 't': 'llh '}], [{'_sigil': 'LK', '_token_array_position': 2, 's': 3, 't': 'mnzl '}], [{'_sigil': 'LK', '_token_array_position': 3, 's': 4, 't': 'ālḥkm '}], [{'_sigil': 'LK', '_token_array_position': 4, 's': 5, 't': 'ʿlá '}], [{'_sigil': 'LK', '_token_array_position': 5, 's': 6, 't': 'ḳlwb '}], [{'_sigil': 'LK', '_token_array_position': 6, 's': 7, 't': 'ālklm '}], [{'_sigil': 'LK', '_token_array_position': 7, 's': 8, 't': 'bāḥdyŧ '}], [{'_sigil': 'LK', '_token_array_position': 8, 's': 9, 't': 'ālṭryḳ '}]]\n", "=========\n", "[[{'_sigil': 'AF', '_token_array_position': 10, 's': 1, 't': 'bnzlylālʿ '}], [{'_sigil': 'AF', '_token_array_position': 11, 's': 2, 't': 'ylrʿā '}], [{'_sigil': 'AF', '_token_array_position': 12, 's': 3, 't': 'ālḥmd '}], [{'_sigil': 'AF', '_token_array_position': 13, 's': 4, 't': 'lh '}], [{'_sigil': 'AF', '_token_array_position': 14, 's': 5, 't': 'mnzl '}], [{'_sigil': 'AF', '_token_array_position': 15, 's': 6, 't': 'ālḥk '}], [{'_sigil': 'AF', '_token_array_position': 16, 's': 7, 't': 'ʿlá '}], [{'_sigil': 'AF', '_token_array_position': 17, 's': 8, 't': 'ḳlwb '}], [{'_sigil': 'AF', '_token_array_position': 18, 's': 9, 't': 'ālklm '}], None, None]\n" ] } ], "source": [ "output = json.loads(result)[\"table\"]\n", "outputLK = output[0]\n", "outputAF = output[1]\n", "\n", "print(output[0])\n", "print(\"=========\")\n", "print(output[1])" ] }, { "cell_type": "markdown", "id": "0dafc903-cd5d-42da-8791-513469d1b712", "metadata": {}, "source": [ "# Postprocessing\n", "\n", "We need to turn the output into a clean alignment list." ] }, { "cell_type": "code", "execution_count": 10, "id": "152695e0-0c5d-4113-9925-dd0d45b4d39c", "metadata": {}, "outputs": [], "source": [ "def makeAlignment(result):\n", " output = json.loads(result)[\"table\"]\n", " outputLK = output[0]\n", " outputAF = output[1]\n", " \n", " alignment = []\n", " for (chunkLK, chunkAF) in zip(outputLK, outputAF):\n", " if chunkLK is None:\n", " iLK = \"\"\n", " textLK = \"\"\n", " else:\n", " iLK = chunkLK[0][\"s\"]\n", " textLK = chunkLK[0][\"t\"]\n", " if chunkAF is None:\n", " iAF = \"\"\n", " textAF = \"\"\n", " else:\n", " iAF = chunkAF[0][\"s\"]\n", " textAF = chunkAF[0][\"t\"]\n", " alignment.append((iLK, textLK, textAF, iAF))\n", " \n", " return alignment" ] }, { "cell_type": "code", "execution_count": 11, "id": "fd384dac-aed8-4667-a8ea-5c0446d87753", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('', '', 'bnzlylālʿ ', 1),\n", " ('', '', 'ylrʿā ', 2),\n", " (1, 'ālḥmd ', 'ālḥmd ', 3),\n", " (2, 'llh ', 'lh ', 4),\n", " (3, 'mnzl ', 'mnzl ', 5),\n", " (4, 'ālḥkm ', 'ālḥk ', 6),\n", " (5, 'ʿlá ', 'ʿlá ', 7),\n", " (6, 'ḳlwb ', 'ḳlwb ', 8),\n", " (7, 'ālklm ', 'ālklm ', 9),\n", " (8, 'bāḥdyŧ ', '', ''),\n", " (9, 'ālṭryḳ ', '', '')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alignment = makeAlignment(result)\n", "alignment" ] }, { "cell_type": "markdown", "id": "bc7efb32-5cc5-46a6-a305-bee85f281ddf", "metadata": {}, "source": [ "# Bigger sizes\n", "\n", "How is the performance?" ] }, { "cell_type": "code", "execution_count": 35, "id": "496f7e64-dd71-4325-9090-957266b31d9d", "metadata": {}, "outputs": [], "source": [ "def test(size=None):\n", " sizeLK = maxLK if size is None else size\n", " sizeAF = maxAF if size is None else size\n", " tokensLK = [dict(t=f\"{getTextLK(slot)} \", s=slot) for slot in range(1, sizeLK)]\n", " tokensAF = [dict(t=f\"{getTextAF(slot)} \", s=slot) for slot in range(1, sizeAF)]\n", "\n", " data = dict(\n", " witnesses=[\n", " dict(id=LK, tokens=tokensLK),\n", " dict(id=AF, tokens=tokensAF),\n", " ],\n", " )\n", " A[LK].indent(reset=True)\n", " A[LK].info(\"Run collatex\")\n", " result = collate(data, output=\"json\", segmentation=False, near_match=True)\n", " A[LK].info(\"collation done\")\n", " alignment = makeAlignment(result)\n", " A[LK].info(f\"postprocessing done. {len(alignment)} entries in alignment table\")\n", " return alignment" ] }, { "cell_type": "code", "execution_count": 36, "id": "bb75cc81-f471-4c56-8843-1e1f4d294ba2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Run collatex\n", " 0.00s collation done\n", " 0.00s postprocessing done. 11 entries in alignment table\n" ] } ], "source": [ "alignment = test(10)" ] }, { "cell_type": "code", "execution_count": 37, "id": "fc4344ab-7257-426c-92e9-27ada332a7de", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Run collatex\n", " 0.10s collation done\n", " 0.10s postprocessing done. 102 entries in alignment table\n" ] } ], "source": [ "alignment = test(100)" ] }, { "cell_type": "code", "execution_count": 38, "id": "9c193520-7ae6-42ed-8e3d-363b2f40fb5b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Run collatex\n", " 7.87s collation done\n", " 7.87s postprocessing done. 1039 entries in alignment table\n" ] } ], "source": [ "alignment = test(1000)" ] }, { "cell_type": "code", "execution_count": 40, "id": "73ddbe56-207c-46ec-9c4a-8d5d9300f661", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Run collatex\n", " 34s collation done\n", " 34s postprocessing done. 2057 entries in alignment table\n" ] } ], "source": [ "alignment = test(2000)" ] }, { "cell_type": "code", "execution_count": 41, "id": "b647973d-7cd2-4674-8e6c-d4ae825e4278", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Run collatex\n", " 2m 44s collation done\n", " 2m 44s postprocessing done. 4095 entries in alignment table\n" ] } ], "source": [ "alignment = test(4000)" ] }, { "cell_type": "markdown", "id": "f387c9ae-4c30-4c1b-8839-f1dbec74ddb5", "metadata": {}, "source": [ "The performance does not scale well.\n", "Our editions are 40,000 words each, so running Collatex on the full input will require 100 times\n", "as much time as this, probably over 5 hours.\n", "\n", "In our case, we are sure that we do not have to compare every part of the one edition\n", "with every part of the other edition, which would require quadratic effort and which\n", "Collatex seems to be needing.\n", "A solution would be to divide the input in 100 word chunks and run Collatex repeatedly\n", "on pairs of chunks.\n", "But that would require quite subtle coding in order to avoid cases where variants occur at \n", "chunk boundaries.\n", "\n", "We also do not get information about the closeness of the variants.\n", "\n", "But how is the quality of the matching?\n", "\n", "We apply the same method as we did after applying the algorithm of the\n", "[`compareAFLk` notebook](compareAfLk.ipynb), with minor modifications.\n", "\n", "We only can do it on the first 10% of the input, because we did not wait for those 5 hours." ] }, { "cell_type": "code", "execution_count": 54, "id": "b680cd8a-7f47-4028-b362-67a1fb7477a7", "metadata": {}, "outputs": [], "source": [ "def printLines(start=0, end=None):\n", " if start < 0:\n", " start = 0\n", " if end is None or end > len(alignment):\n", " end = len(alignment)\n", " lines = []\n", " for (iLK, left, right, iAF) in alignment[start:end]:\n", " lines.append(f\"{iLK:>5} {left:>20} @{0 if left == right else 1} {right:<20} {iAF:>5}\")\n", " return \"\\n\".join(lines)\n", " \n", " \n", "def printDiff(before, after):\n", " print(printLines(start=len(alignment) - before))\n", " lastLK = None\n", " lastAF = None\n", " for c in range(len(alignment) - 1, -1, -1):\n", " comp = alignment[c]\n", " if lastLK is None:\n", " if comp[0]:\n", " lastLK = comp[0]\n", " if lastAF is None:\n", " if comp[3]:\n", " lastAF = comp[3]\n", " if lastLK is not None and lastAF is not None:\n", " break\n", " if lastLK is not None and lastAF is not None:\n", " for i in range(after):\n", " iLK = lastLK + 1 + i\n", " iAF = lastAF + 1 + i\n", " textLK = getTextLK(iLK) if iLK <= maxLK else \"\"\n", " textAF = getTextAF(iAF) if iAF <= maxAF else \"\"\n", " print(f\"{iLK:>5} = {textLK:>20} @{0 if textLK == textAF else 1} {textAF:<20} = {iAF:>5}\")" ] }, { "cell_type": "code", "execution_count": 55, "id": "dab9ffa0-d023-47ec-8a38-5c6c22003a01", "metadata": {}, "outputs": [], "source": [ "# this number of good lines between bad lines will not lead to the\n", "# interruption of bad stretches\n", "\n", "LOOKAHEAD = 3\n", "\n", "\n", "def analyseStretch(start, end):\n", " total = 0\n", " onlyLK = 0\n", " onlyAF = 0\n", " \n", " for (iLK, left, right, iAF) in alignment[start:end + 1]:\n", " total += 1\n", " if not iLK:\n", " onlyAF += 1\n", " if not iAF:\n", " onlyLK += 1\n", " \n", " suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5\n", " return suspect\n", " \n", "def checkAlignment(lastLK, lastAF):\n", " errors = {}\n", " prevILK = 0\n", " prevIAF = 0\n", " \n", " where = collections.Counter()\n", " agreement = collections.Counter()\n", " badStretches = collections.defaultdict(lambda: [])\n", " \n", " startBad = 0\n", " \n", " for (c, (iLK, left, right, iAF)) in enumerate(alignment):\n", " thisBad = not iLK or not iAF\n", " # a good line between bad lines is counted as bad\n", " if not thisBad and startBad:\n", " nextGood = True\n", " for j in range(1, LOOKAHEAD + 1):\n", " if c + j < len(alignment):\n", " compJ = alignment[c + j]\n", " if not compJ[0] or not compJ[-1]:\n", " nextGood = False\n", " break\n", " if not nextGood:\n", " thisBad = True\n", " if startBad:\n", " if not thisBad:\n", " badStretches[c - startBad].append(startBad)\n", " startBad = 0\n", " else:\n", " if thisBad:\n", " startBad = c\n", " \n", " agreement[0 if left == right else 1] += 1\n", " \n", " if iLK:\n", " if iLK != prevILK + 1:\n", " errors.setdefault(\"wrong iLK\", []).append(f\"{c:>5}: Expected {prevILK + 1}, found {iLK}\")\n", " prevILK = iLK\n", " if iAF:\n", " where[\"both\"] += 1\n", " else:\n", " where[AF] += 1\n", " if iAF:\n", " if iAF != prevIAF + 1:\n", " errors.setdefault(\"wrong iAF\", []).append(f\"{c:>5}: Expected {prevIAF + 1}, found {iAF}\")\n", " prevIAF = iAF\n", " else:\n", " where[LK] += 1\n", " \n", " if startBad:\n", " badStretches[len(alignment) - startBad].append(startBad)\n", " \n", " if prevILK < lastLK:\n", " errors.setdefault(\"missing iLKs at the end\", []).append(f\"last is {prevILK}, expected {lastLK}\")\n", " elif prevILK > lastLK:\n", " errors.setdefault(\"too many iLKs at the end\", []).append(f\"last is {prevILK}, expected {lastLK}\")\n", " if prevIAF < lastAF:\n", " errors.setdefault(\"missing iAFs at the end\", []).append(f\"last is {prevIAF}, expected {lastAF}\")\n", " elif prevIAF > lastAF:\n", " errors.setdefault(\"too many iAFs at the end\", []).append(f\"last is {prevIAF}, expected {lastAF}\")\n", " \n", " print(\"\\nSANITY\\n\")\n", " if not errors:\n", " print(\"All OK\")\n", " else:\n", " for (kind, msgs) in errors.items():\n", " print(f\"ERROR {kind} ({len(msgs):>5}x):\")\n", " for msg in msgs[0:10]:\n", " print(f\"\\t{msg}\")\n", " if len(msgs) > 10:\n", " print(f\"\\t ... and {len(msgs) - 10} more ...\")\n", " \n", " print(f\"\\nAGREEMENT\\n\")\n", " print(\"Where are the words?\\n\")\n", " print(f\"\\t{LK}-only: {where[LK]:>5} slots\")\n", " print(f\"\\t{AF}-only: {where[AF]:>5} slots\")\n", " print(f\"\\tboth: {where['both']:>5} slots\")\n", " \n", " print(\"\\nHow well is the agreement?\\n\")\n", " for (d, n) in agreement.items():\n", " print(f\"dissimilarity? {d} : {n:>5} words\")\n", " \n", " print(f\"\\nBAD STRETCHES\\n\")\n", " print(\"How many of which size?\\n\")\n", " allSuspects = []\n", " someBenigns = []\n", " for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])):\n", " suspects = {start: size for start in starts if analyseStretch(start, start + size)}\n", " benigns = {start: size for start in starts if start not in suspects}\n", " allSuspects.extend([(start, start + size) for (start, size) in suspects.items()])\n", " someBenigns.extend([(start, start + size) for (start, size) in list(benigns.items())[0:3]])\n", " examples = \", \".join(str(start) for start in list(suspects.keys())[0:3])\n", " if not suspects:\n", " examples = \", \".join(str(start) for start in list(benigns.keys())[0:3])\n", " print(f\"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}\")\n", " \n", " print(f\"\\nShowing all {len(allSuspects)} inversion suspects\" if len(allSuspects) else \"\\nNo suspect bad stretches\\n\")\n", " for (i, (start, end)) in enumerate(reversed(allSuspects)):\n", " print(f\"\\nSUSPECT {i + 1:>2}\")\n", " print(printLines(max((1, start - 5)), min((len(alignment), end + 5))))\n", " print(f\"\\nShowing some ({len(someBenigns)}) benign examples\" if len(someBenigns) else \"\\nNo bad stretches\\n\")\n", " for (i, (start, end)) in enumerate(someBenigns):\n", " print(f\"\\nBENIGN {i + 1:>2}\")\n", " print(printLines(max((1, start - 2)), min((len(alignment), end + 2))))" ] }, { "cell_type": "code", "execution_count": 56, "id": "af18ec5a-797c-407b-94ba-fbffaf5183e6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "SANITY\n", "\n", "All OK\n", "\n", "AGREEMENT\n", "\n", "Where are the words?\n", "\n", "\tLK-only: 96 slots\n", "\tAF-only: 96 slots\n", "\tboth: 3903 slots\n", "\n", "How well is the agreement?\n", "\n", "dissimilarity? 1 : 520 words\n", "dissimilarity? 0 : 3575 words\n", "\n", "BAD STRETCHES\n", "\n", "How many of which size?\n", "\n", "bad stretches of size 48 : 0 suspect of total 1 x see e.g. 4047\n", "bad stretches of size 9 : 0 suspect of total 1 x see e.g. 458\n", "bad stretches of size 8 : 0 suspect of total 1 x see e.g. 3103\n", "bad stretches of size 5 : 0 suspect of total 5 x see e.g. 346, 431, 513\n", "bad stretches of size 4 : 0 suspect of total 5 x see e.g. 636, 2897, 3146\n", "bad stretches of size 3 : 0 suspect of total 10 x see e.g. 332, 356, 380\n", "bad stretches of size 2 : 0 suspect of total 8 x see e.g. 501, 553, 799\n", "bad stretches of size 1 : 0 suspect of total 72 x see e.g. 1, 14, 117\n", "\n", "No suspect bad stretches\n", "\n", "\n", "Showing some (18) benign examples\n", "\n", "BENIGN 1\n", " 3998 mtnāh @0 mtnāh 3950\n", " 3999 wān @1 ān 3951\n", " @1 kānt 3952\n", " @1 trǧʿ 3953\n", " @1 ālá 3954\n", " @1 āṣwl 3955\n", " @1 mtnāhyŧ 3956\n", " @1 hy 3957\n", " @1 āmhāt 3958\n", " @1 ālāsmāʾ 3959\n", " @1 āw 3960\n", " @1 ḥḍrāt 3961\n", " @1 ālāsmāʾ 3962\n", " @1 wʿlá 3963\n", " @1 ālḥḳyḳŧ 3964\n", " @1 fmā 3965\n", " @1 ṯm 3966\n", " @1 ālā 3967\n", " @1 ḥḳyḳŧ 3968\n", " @1 wāḥdŧ 3969\n", " @1 tḳbl 3970\n", " @1 ǧmyʿ 3971\n", " @1 hḏh 3972\n", " @1 ālnsb 3973\n", " @1 wālāḍāfāt 3974\n", " @1 ālty 3975\n", " @1 ykná 3976\n", " @1 ʿnhā 3977\n", " @1 bālāsmāʾ 3978\n", " @1 ālālhyŧ 3979\n", " @1 wālḥḳyḳŧ 3980\n", " @1 tʿṭy 3981\n", " @1 ān 3982\n", " @1 ykwn 3983\n", " @1 lkl 3984\n", " @1 āsm 3985\n", " @1 yẓhr 3986\n", " @1 ālá 3987\n", " @1 mā 3988\n", " @1 lā 3989\n", " @1 ytnāhá 3990\n", " @1 ḥḳyḳŧ 3991\n", " @1 ytmyz 3992\n", " @1 bhā 3993\n", " @1 ʿn 3994\n", " @1 āsm 3995\n", " @1 āḫr 3996\n", " @1 tlk 3997\n", " @1 ālḥḳyḳŧ 3998\n", " @1 ālty 3999\n", "\n", "BENIGN 2\n", " 448 ālʿālm @0 ālʿālm 441\n", " 449 ālmʿbr @1 ālmʿbrʿnh 442\n", " 450 ʿnh @1 \n", " 451 fy @0 fy 443\n", " 452 āṣṭlāḥ @1 āṣṭlāḥālḳwm 444\n", " 453 ālḳwm @1 \n", " 454 bālānsān @1 ālānsānālkbyr 445\n", " 455 ālkbyr @1 \n", " 456 fkānt @0 fkānt 446\n", " 457 ālmlāʾkŧ @0 ālmlāʾkŧ 447\n", " 458 lh @1 \n", " 459 kālḳwá @1 lhkālḳwá 448\n", " 460 ālrwḥānyŧ @0 ālrwḥānyŧ 449\n", "\n", "BENIGN 3\n", " 3066 ykwn @0 ykwn 3027\n", " 3067 ābdā @1 ābdālā 3028\n", " 3068 ālā @1 \n", " 3069 bṣwrŧ @0 bṣwrŧ 3029\n", " 3070 āstʿdād @0 āstʿdād 3030\n", " 3071 ālmtǧlá @1 ālmtǧl 3031\n", " 3072 lh @1 \n", " 3073 ġyr @1 \n", " 3074 ḏlk @1 \n", " 3075 lā @1 \n", " 3076 ykwn @1 ālhwġyrḏlklāykwn 3032\n", " 3077 fāḏā @1 fāḏn 3033\n", "\n", "BENIGN 4\n", " 339 lh @0 lh 337\n", " 340 mn @0 mn 338\n", " 341 ġyr @1 \n", " 342 wǧwd @1 ġyrwǧwd 339\n", " 343 hḏā @0 hḏā 340\n", " 344 ālmḥl @0 ālmḥl 341\n", " 345 wlā @1 \n", " 346 tǧlyh @1 wlātǧlyh 342\n", " 347 lh @0 lh 343\n", "\n", "BENIGN 5\n", " 422 ābtdā @0 ābtdā 415\n", " 423 mnh @0 mnh 416\n", " @1 ā 417\n", " 424 fāḳtḍá @0 fāḳtḍá 418\n", " 425 ālāmr @0 ālāmr 419\n", " 426 ǧlāʾ @0 ǧlāʾ 420\n", " 427 mrāŧ @1 \n", " 428 ālʿālm @1 mrātālʿālm 421\n", " 429 fkān @0 fkān 422\n", "\n", "BENIGN 6\n", " 503 ḥḳyḳŧ @0 ḥḳyḳŧ 488\n", " 504 ālḥḳāʾḳ @0 ālḥḳāʾḳ 489\n", " @1 w 490\n", " 505 wfy @1 fy 491\n", " 506 ālnšāŧ @0 ālnšāŧ 492\n", " 507 ālḥāmlŧ @1 ālḥāmlŧlhḏh 493\n", " 508 lhḏh @1 \n", " 509 ālāwṣāf @0 ālāwṣāf 494\n", " 510 ālá @1 lá 495\n", "\n", "BENIGN 7\n", " 621 tʿālá @0 tʿālá 604\n", " 622 ālḥāfẓ @0 ālḥāfẓ 605\n", " @1 bh 606\n", " 623 ḫlḳh @0 ḫlḳh 607\n", " 624 kmā @0 kmā 608\n", " 625 yḥfẓ @1 \n", " 626 ālḫtm @1 yḥfẓālḫtm 609\n", " 627 ālḫzāʾn @1 ālḫzān 610\n", "\n", "BENIGN 8\n", " 2861 bāʿlām @0 bāʿlām 2826\n", " 2862 āllh @0 āllh 2827\n", " 2863 āyāh @1 \n", " 2864 bmā @1 \n", " 2865 āʿṭāh @1 āyāmbmāʿṭāhʿynh 2828\n", " 2866 ʿynh @1 \n", " 2867 mn @0 mn 2829\n", " 2868 ālʿlm @0 ālʿlm 2830\n", "\n", "BENIGN 9\n", " 3109 lā @0 lā 3063\n", " 3110 trāhā @1 trāhāmʿ 3064\n", " 3111 mʿ @1 \n", " 3112 ʿlmk @0 ʿlmk 3065\n", " 3113 ānk @1 ānkmā 3066\n", " 3114 mā @1 \n", " 3115 rāyt @0 rāyt 3067\n", " 3116 ālṣwr @0 ālṣwr 3068\n", "\n", "BENIGN 10\n", " 325 fānh @0 fānh 325\n", " 326 tẓhr @1 yẓhrlh 326\n", " 327 lh @1 \n", " 328 nfsh @0 nfsh 327\n", " 329 fy @1 \n", " 330 ṣwrŧ @1 fyṣwrŧ 328\n", " 331 yʿṭyhā @0 yʿṭyhā 329\n", "\n", "BENIGN 11\n", " 349 kān @0 kān 345\n", " 350 ālḥḳ @0 ālḥḳ 346\n", " @1 sbḥānh 347\n", " 351 āwǧd @1 āw 348\n", " @1 ǧd 349\n", " 352 ālʿālm @0 ālʿālm 350\n", " 353 klh @0 klh 351\n", "\n", "BENIGN 12\n", " 371 mḥlā @1 mḥl 369\n", " 372 ālā @0 ālā 370\n", " 373 wlā @1 \n", " 374 bd @1 \n", " 375 ān @1 \n", " 376 yḳbl @1 wyḳbl 371\n", " 377 rwḥā @0 rwḥā 372\n", "\n", "BENIGN 13\n", " 491 ālǧmʿyŧ @0 ālǧmʿyŧ 478\n", " 492 ālālhyŧ @1 ālālhyŧmā 479\n", " 493 byn @1 \n", " 494 mā @1 \n", " 495 yrǧʿ @0 yrǧʿ 480\n", " 496 mn @0 mn 481\n", "\n", "BENIGN 14\n", " 542 mā @0 mā 524\n", " 543 āṣl @0 āṣl 525\n", " 544 ṣwr @1 \n", " 545 ālʿālm @1 \n", " 546 ālḳāblŧ @1 ṣwrālʿālmālḳāblŧ 526\n", " 547 lārwāḥh @0 lārwāḥh 527\n", "\n", "BENIGN 15\n", " 782 wlā @0 wlā 764\n", " 783 ḳdsth @0 ḳdsth 765\n", " @1 tḳdys 766\n", " @1 ādm 767\n", " 784 fġlb @0 fġlb 768\n", " 785 ʿlyhā @0 ʿlyhā 769\n", "\n", "BENIGN 16\n", " @1 ylrʿā 2\n", " 1 ālḥmd @0 ālḥmd 3\n", " 2 llh @1 lh 4\n", "\n", "BENIGN 17\n", " 11 mn @0 mn 13\n", " 12 ālmḳām @0 ālmḳām 14\n", " @1 ā 15\n", " 13 ālāḳdm @0 ālāḳdm 16\n", " 14 wān @0 wān 17\n", "\n", "BENIGN 18\n", " 113 wsālt @0 wsālt 116\n", " 114 āllh @0 āllh 117\n", " @1 tʿālá 118\n", " 115 ān @0 ān 119\n", " 116 yǧʿlny @0 yǧʿlny 120\n" ] } ], "source": [ "checkAlignment(4000 - 1, 4000 - 1)" ] }, { "cell_type": "markdown", "id": "eb9acaf5-5c33-4ec5-a52e-1c94340570ec", "metadata": {}, "source": [ "# A few comparisons" ] }, { "cell_type": "markdown", "id": "52bc3a2b-003a-4a0b-a2b0-f51bb169f0fc", "metadata": {}, "source": [ "**with Collatex**" ] }, { "cell_type": "code", "execution_count": 58, "id": "3e34458b-550e-4462-b84a-882226f9f50a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 287 ālāḥṣāʾ @0 ālāḥṣāʾ 288\n", " @1 s 289\n", " 288 ān @0 ān 290\n", " 289 yrá @0 yrá 291\n", " 290 āʿyānhā @0 āʿyānhā 292\n", " 291 wān @1 ān 293\n", " 292 šʾt @1 šʾtḳlt 294\n", " 293 ḳlt @1 \n", " 294 ān @0 ān 295\n", " 295 yrá @0 yrá 296\n", " 296 ʿynh @0 ʿynh 297\n", " 297 fy @1 y 298\n", " 298 kwn @1 kwnǧāmʿ 299\n", " 299 ǧāmʿ @1 yḥṣrālāmr 300\n", " 300 yḥṣr @1 kh 301\n", " 301 ālāmr @1 \n", " 302 lkwnh @0 lkwnh 302\n", " 303 mtṣfā @0 mtṣfā 303\n" ] } ], "source": [ "print(printLines(start=291, end=309))" ] }, { "cell_type": "markdown", "id": "7939828c-fd49-4609-98c4-d1da1ed30d05", "metadata": {}, "source": [ "**with my algorithm**" ] }, { "cell_type": "markdown", "id": "69470c65-fe95-4ff9-84ef-cb03d437de14", "metadata": {}, "source": [ "```\n", "287 = ālāḥṣāʾ @0 ālāḥṣāʾ = 288\n", "288 +1 ān @1 s 2+ 289\n", " ^1 @1 ān 2+ 290\n", "289 = yrá @0 yrá = 291\n", "290 = āʿyānhā @0 āʿyānhā = 292\n", "291 = wān @1 ān = 293\n", "292 +2 šʾt @0 šʾtḳlt 1+ 294\n", "293 +2 ḳlt @0 1^ \n", "294 = ān @0 ān = 295\n", "295 = yrá @0 yrá = 296\n", "296 = ʿynh @0 ʿynh = 297\n", "297 = fy @1 y = 298\n", "298 +2 kwn @0 kwnǧāmʿ 1+ 299\n", "299 +2 ǧāmʿ @0 1^ \n", "300 +2 yḥṣr @0 yḥṣrālāmr 1+ 300\n", "301 +2 ālāmr @0 1^ \n", "302 +1 lkwnh @2 kh 2+ 301\n", " ^1 @2 lkwnh 2+ 302\n", "303 = mtṣfā @0 mtṣfā = 303\n", "```" ] }, { "cell_type": "markdown", "id": "f3b04f1c-1f29-4900-bba7-7ee02bd7973a", "metadata": {}, "source": [ "(A) Lines 288 are better handled by Collatex than by my algorithm.\n", "\n", "(B) The lines 298-301 are better handled by my algorithm than by Collatex." ] }, { "cell_type": "markdown", "id": "f32ed950-2a6c-4e5f-ac65-9cc4724791f5", "metadata": {}, "source": [ "**with Collatex**" ] }, { "cell_type": "code", "execution_count": 60, "id": "c7b12c6d-387a-4393-8457-0ce13c142882", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 448 = ālʿālm @0 ālʿālm = 441\n", " 449 +2 ālmʿbr @0 ālmʿbrʿnh 1+ 442\n", " 450 +2 ʿnh @0 1^ \n", " 451 = fy @0 fy = 443\n", " 452 +2 āṣṭlāḥ @0 āṣṭlāḥālḳwm 1+ 444\n", " 453 +2 ālḳwm @0 1^ \n", " 454 +2 bālānsān @1 ālānsānālkbyr 1+ 445\n", " 455 +2 ālkbyr @1 1^ \n", " 456 = fkānt @0 fkānt = 446\n", " 457 = ālmlāʾkŧ @0 ālmlāʾkŧ = 447\n", " 458 +2 lh @0 lhkālḳwá 1+ 448\n", " 459 +2 kālḳwá @0 1^ \n", " 460 = ālrwḥānyŧ @0 ālrwḥānyŧ = 449\n" ] } ], "source": [ "print(printLines(start=457, end=470))" ] }, { "cell_type": "markdown", "id": "1369ec85-96ee-4088-80e3-cee3182c14be", "metadata": {}, "source": [ "**with my algorithm**" ] }, { "cell_type": "markdown", "id": "4ac47166-16ef-4a86-8930-763aac071bb5", "metadata": {}, "source": [ "```\n", "448 ālʿālm @0 ālʿālm 441\n", "449 ālmʿbr @1 ālmʿbrʿnh 442\n", "450 ʿnh @1 \n", "451 fy @0 fy 443\n", "452 āṣṭlāḥ @1 āṣṭlāḥālḳwm 444\n", "453 ālḳwm @1 \n", "454 bālānsān @1 ālānsānālkbyr 445\n", "455 ālkbyr @1 \n", "456 fkānt @0 fkānt 446\n", "457 ālmlāʾkŧ @0 ālmlāʾkŧ 447\n", "458 lh @1 \n", "459 kālḳwá @1 lhkālḳwá 448\n", "460 ālrwḥānyŧ @0 ālrwḥānyŧ 449\n", "```" ] }, { "cell_type": "markdown", "id": "a0e9031d-3722-4c38-8e80-28dd572ea025", "metadata": {}, "source": [ "(C) Lines 458-459 are better handled by my algorithm than by Collatex.\n", "\n", "Note that A and C are similar cases. Sometimes my algorithm chooses the best fit, sometimes Collatex does.\n", "Anyway, this kind of decision is not very important for the dataset we want to build from this table.\n", "\n", "Case B is a bit more involved, and there Collatex fails to see a more obvious alignment.\n", "\n", "# Conclusion\n", "\n", "The performance is the biggest obstacle for using Collatex here.\n", "A rather superficial comparison between the resulting alignments does not show marked differences in quality,\n", "although there is an indication that Collatex will deal a bit less graceful with convoluted situations.\n", "\n", "But closer inspection might reveal that Collatex has it right more often than my algorithm.\n", "\n", "However, because both are not perfect, it is important to be able to tweak if there are glaring\n", "mistakes.\n", "In Collatex we do not have obvious means to steer the algorithm further.\n", "\n", "With my algorithm we have the options to define special cases, to tweak a number of parameters, and to change the orchestration\n", "of the comparisons.\n", "\n", "That's why we stick to my algorithm." ] }, { "cell_type": "code", "execution_count": null, "id": "8ff0ba11-f297-438a-946b-17b0fc991ed0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }