{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Make node mapping between two versions of the TF dataset\n", "\n", "In this notebook we map the *slot* nodes from version 0.4 (source version) to 0.7 (target version).\n", "\n", "Basically this means that we map all slots from the source version to corresponding slots in the target version.\n", "In the target version there are more slots, because also footnotes occupy slots there, in contrast with the\n", "source version, where footnotes only appear inside feature values of the slot that precedes the footnote mark.\n", "\n", "Some slots have an empty text (most of them contain some punctuation).\n", "\n", "We do not want to be fussy about those slots.\n", "We map them unto corresponding empty slots if possible, otherwise we map them onto the nearest\n", "non-empty slot.\n", "\n", "After establishing the slot mapping, we extend the mapping to all nodes in a generic way.\n", "The code for this is already in the TF library." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from tf.fabric import Fabric\n", "from tf.dataset import Versions\n", "\n", "from lib import TF_DIR\n", "\n", "va = \"0.4\"\n", "# vb = \"0.9.1\"\n", "vb = \"1.0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load the two versions of the TF data by means of the lower level `Fabric` method,\n", "and we only load the features we need." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "TF = {}\n", "api = {}\n", "E = {}\n", "Es = {}\n", "F = {}\n", "Fs = {}\n", "L = {}\n", "T = {}\n", "maxSlot = {}\n", "features = {\n", " va: \"trans\",\n", " vb: \"trans isnote\",\n", "}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 9.4.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "35 features found and 0 ignored\n", " 3.83s All features loaded/computed - for details use TF.isLoaded()\n", "This is Text-Fabric 9.4.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "43 features found and 0 ignored\n", " 4.78s All features loaded/computed - for details use TF.isLoaded()\n" ] } ], "source": [ "for v in (va, vb):\n", " TF[v] = Fabric(locations=TF_DIR, modules=v)\n", " api[v] = TF[v].load(features[v])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We put the parts of the TF API that we need in the various dictionaries." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "for v in (va, vb):\n", " E[v] = api[v].E\n", " Es[v] = api[v].Es\n", " F[v] = api[v].F\n", " Fs[v] = api[v].Fs\n", " L[v] = api[v].L\n", " T[v] = api[v].T\n", " maxSlot[v] = F[v].otype.maxSlot" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Make the slot mapping\n", "\n", "We walk through the slots of the target version (0.7) and skip its footnote slots.\n", "For each target slot we increase the slot in the source version in (0.5), and check whether\n", "source and target slots have the same value for the `trans` feature.\n", "If not, and one of them is empty, we skip the empty word and try the next one.\n", "But if both are not empty and unequal, we have a real problem: a mismatch.\n", "\n", "However, in version 0.5 we have an imperfect separation of numbers and words.\n", "So, sometimes we have to split words.\n", "\n", "In that case we stop, and you have to inspect what is happening." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def makeSlotMapOld():\n", " Fa = F[va]\n", " Fb = F[vb]\n", " transA = Fa.trans.v\n", " transB = Fb.trans.v\n", " isNote = Fb.isnote.v\n", " maxSlotA = maxSlot[va]\n", " maxSlotB = maxSlot[vb]\n", "\n", " print(\n", " f\"\"\"\\\n", " Computing slotMap between:\n", " {va}: {maxSlotA:>8} slots,\n", " {vb}: {maxSlotB:>8} slots.\\\n", "\"\"\"\n", " )\n", "\n", " slotMap = {}\n", "\n", " good = True\n", " wA = 1\n", " emptyA = 0\n", " emptyB = 0\n", "\n", " for wB in range(1, maxSlotB + 1):\n", " if isNote(wB):\n", " continue\n", " textA = transA(wA) or \"\"\n", " textB = transB(wB) or \"\"\n", "\n", " if textB == \"\":\n", " if textA != \"\":\n", " emptyB += 1\n", " continue\n", " else:\n", " while textA == \"\" and wA < maxSlotA:\n", " wA += 1\n", " emptyA += 1\n", " textA = transA(wA) or \"\"\n", "\n", " if textA != textB:\n", " print(\"Mismatch:\")\n", " print(f\"A: {wA:>8} = `{textA}`\")\n", " print(f\"B: {wB:>8} = `{textB}`\")\n", " good = False\n", " break\n", "\n", " if wA <= maxSlotA:\n", " slotMap.setdefault(wA, {})[wB] = None\n", " wA += 1\n", " else:\n", " if textB:\n", " print(f\"No more slots in {va} to match slot {wB} in {vb}\")\n", " break\n", "\n", " maxSlotMap = max(slotMap)\n", " if maxSlotMap > maxSlotA:\n", " print(f\"maxSlot in A version {va} exceeded\")\n", " print(f\"Found {maxSlotMap}, but it should be <= {maxSlot[va]}\")\n", " good = False\n", "\n", " if good:\n", " print(\n", " f\"\"\"\\\n", "slotMap succesfully created: {len(slotMap)} slots mapped.\n", "{va}: {emptyA:>6} empty slots,\n", "{vb}: {emptyB:>6} empty slots.\\\n", "\"\"\"\n", " )\n", " return slotMap" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def makeSlotMap():\n", " Fa = F[va]\n", " Fb = F[vb]\n", " transA = Fa.trans.v\n", " transB = Fb.trans.v\n", " isNote = Fb.isnote.v\n", " maxSlotA = maxSlot[va]\n", " maxSlotB = maxSlot[vb]\n", "\n", " print(\n", " f\"\"\"\\\n", " Computing slotMap between:\n", " {va}: {maxSlotA:>8} slots,\n", " {vb}: {maxSlotB:>8} slots.\\\n", "\"\"\"\n", " )\n", "\n", " slotMap = {}\n", "\n", " good = True\n", " wA = 1\n", " wB = 1\n", "\n", " while wB <= maxSlotB and wA <= maxSlotA:\n", " if isNote(wB):\n", " wB += 1\n", " continue\n", "\n", " textA = transA(wA) or \"\"\n", " textB = transB(wB) or \"\"\n", "\n", " if textA == textB:\n", " slotMap.setdefault(wA, {})[wB] = None\n", " wA += 1\n", " wB += 1\n", "\n", " elif textA.startswith(textB):\n", " slotMap.setdefault(wA, {})[wB] = None\n", " wB += 1\n", " elif textA.endswith(textB):\n", " wA += 1\n", " wB += 1\n", "\n", " elif textB.startswith(textA):\n", " slotMap.setdefault(wA, {})[wB] = None\n", " wA += 1\n", " elif textB.endswith(textA):\n", " slotMap.setdefault(wA, {})[wB] = None\n", " wA += 1\n", " wB += 1\n", "\n", " else:\n", " print(\"Mismatch:\")\n", " print(f\"A: {wA:>8} = `{textA}`\")\n", " print(f\"B: {wB:>8} = `{textB}`\")\n", " good = False\n", " break\n", "\n", " maxSlotMap = max(slotMap)\n", " if maxSlotMap > maxSlotA:\n", " print(f\"maxSlot in A version {va} exceeded\")\n", " print(f\"Found {maxSlotMap}, but it should be <= {maxSlot[va]}\")\n", " good = False\n", "\n", " if good:\n", " print(\n", " f\"\"\"\\\n", "slotMap succesfully created: {len(slotMap)} slots mapped.\n", "\"\"\"\n", " )\n", " return slotMap" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Computing slotMap between:\n", " 0.4: 5030444 slots,\n", " 1.0: 5977367 slots.\n", "slotMap succesfully created: 5030444 slots mapped.\n", "\n" ] } ], "source": [ "slotMap = makeSlotMap()" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "When we encounter problems, we can do a bit of checking to see what is going on.\n", "\n", "The next function shows the line around a slot node, and can do so in both versions." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def show(v, n):\n", " lines = L[v].u(n, otype=\"line\")\n", " if not lines:\n", " lines = L[v].u(n + 1, otype=\"line\")\n", " if not lines:\n", " lines = L[v].u(n - 1, otype=\"line\")\n", " if not lines:\n", " print(\"no such line\")\n", " return\n", " line = lines[0]\n", " print(T[v].sectionFromNode(line))\n", " words = L[v].d(line, otype=\"word\")\n", " print(\" \".join(f\"[{w}={F[v].trans.v(w)}]\" for w in words))\n", " print(T[v].text(line))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, None, 2)\n", "[45=961] [46=copie] [47=5] [48=folio] [49=s]\n", "961, copie, 5 folio's.\n", "(1, 3, 4)\n", "[95=„Journaelsgewijse] [96=reisbeschrijving]\n", "„Journaelsgewijse\" reisbeschrijving » \n" ] } ], "source": [ "show(va, 49)\n", "show(vb, 96)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make the complete node map\n", "\n", "We now extend the `slotMap` to a full node map.\n", "\n", "See [dataset.Versions](https://annotation.github.io/text-fabric/tf/dataset/nodemaps.html#tf.dataset.nodemaps.Versions) in the Text-Fabric documentation." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "V = Versions(api, va, vb, slotMap)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "**********************************************************************************************\n", "* *\n", "* 0.00s Mapping volume nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 0.00s Extending slot mapping 0.4 ==> 1.0 for volume nodes\n", "| 10s \tDone\n", "..............................................................................................\n", ". 10s Statistics for 0.4 ==> 1.0 (volume) .\n", "..............................................................................................\n", "| 10s \tTOTAL : 100.00% 13x\n", "| 10s \tunique, imperfect : 100.00% 13x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 10s Mapping letter nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 10s Extending slot mapping 0.4 ==> 1.0 for letter nodes\n", "| 20s \tDone\n", "..............................................................................................\n", ". 20s Statistics for 0.4 ==> 1.0 (letter) .\n", "..............................................................................................\n", "| 20s \tTOTAL : 100.00% 589x\n", "| 20s \tunique, perfect : 21.05% 124x\n", "| 20s \tunique, imperfect : 77.93% 459x\n", "| 20s \tmultiple, non-perfect : 1.02% 6x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 20s Mapping page nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 20s Extending slot mapping 0.4 ==> 1.0 for page nodes\n", "| 30s \tDone\n", "..............................................................................................\n", ". 30s Statistics for 0.4 ==> 1.0 (page) .\n", "..............................................................................................\n", "| 30s \tTOTAL : 100.00% 10149x\n", "| 30s \tunique, perfect : 47.53% 4824x\n", "| 30s \tunique, imperfect : 51.63% 5240x\n", "| 30s \tmultiple, non-perfect : 0.84% 85x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 30s Mapping table nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 30s Extending slot mapping 0.4 ==> 1.0 for table nodes\n", "| 30s \tDone\n", "..............................................................................................\n", ". 30s Statistics for 0.4 ==> 1.0 (table) .\n", "..............................................................................................\n", "| 30s \tTOTAL : 100.00% 322x\n", "| 30s \tunique, perfect : 92.55% 298x\n", "| 30s \tunique, imperfect : 7.14% 23x\n", "| 30s \tmultiple, non-perfect : 0.31% 1x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 30s Mapping para nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 30s Extending slot mapping 0.4 ==> 1.0 for para nodes\n", "| 37s \tDone\n", "..............................................................................................\n", ". 37s Statistics for 0.4 ==> 1.0 (para) .\n", "..............................................................................................\n", "| 37s \tTOTAL : 100.00% 33885x\n", "| 37s \tunique, perfect : 77.44% 26242x\n", "| 37s \tunique, imperfect : 22.50% 7625x\n", "| 37s \tmultiple, non-perfect : 0.01% 5x\n", "| 37s \tnot mapped : 0.04% 13x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 37s Mapping remark nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 37s Extending slot mapping 0.4 ==> 1.0 for remark nodes\n", "| 40s \tDone\n", "..............................................................................................\n", ". 40s Statistics for 0.4 ==> 1.0 (remark) .\n", "..............................................................................................\n", "| 40s \tTOTAL : 100.00% 22922x\n", "| 40s \tunique, perfect : 97.36% 22318x\n", "| 40s \tunique, imperfect : 2.64% 604x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 40s Mapping head nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 40s Extending slot mapping 0.4 ==> 1.0 for head nodes\n", "| 40s \tDone\n", "..............................................................................................\n", ". 40s Statistics for 0.4 ==> 1.0 (head) .\n", "..............................................................................................\n", "| 40s \tTOTAL : 100.00% 589x\n", "| 40s \tunique, perfect : 91.17% 537x\n", "| 40s \tunique, imperfect : 8.83% 52x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 40s Mapping line nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 40s Extending slot mapping 0.4 ==> 1.0 for line nodes\n", "| 51s \tDone\n", "..............................................................................................\n", ". 51s Statistics for 0.4 ==> 1.0 (line) .\n", "..............................................................................................\n", "| 51s \tTOTAL : 100.00% 444978x\n", "| 51s \tunique, perfect : 97.15% 432317x\n", "| 51s \tunique, imperfect : 0.40% 1770x\n", "| 51s \tmultiple, cleanly composed : 0.31% 1368x\n", "| 51s \tmultiple, non-perfect : 2.14% 9523x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 51s Mapping row nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 51s Extending slot mapping 0.4 ==> 1.0 for row nodes\n", "| 51s \tDone\n", "..............................................................................................\n", ". 51s Statistics for 0.4 ==> 1.0 (row) .\n", "..............................................................................................\n", "| 51s \tTOTAL : 100.00% 4566x\n", "| 51s \tunique, perfect : 98.90% 4516x\n", "| 51s \tunique, imperfect : 1.10% 50x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 51s Mapping folio nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 51s Extending slot mapping 0.4 ==> 1.0 for folio nodes\n", "| 51s \tDone\n", "..............................................................................................\n", ". 51s Statistics for 0.4 ==> 1.0 (folio) .\n", "..............................................................................................\n", "| 51s \tTOTAL : 100.00% 2555x\n", "| 51s \tunique, perfect : 95.58% 2442x\n", "| 51s \tunique, imperfect : 4.11% 105x\n", "| 51s \tnot mapped : 0.31% 8x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 51s Mapping cell nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 51s Extending slot mapping 0.4 ==> 1.0 for cell nodes\n", "| 52s \tDone\n", "..............................................................................................\n", ". 52s Statistics for 0.4 ==> 1.0 (cell) .\n", "..............................................................................................\n", "| 52s \tTOTAL : 100.00% 20593x\n", "| 52s \tunique, perfect : 99.71% 20533x\n", "| 52s \tunique, imperfect : 0.28% 58x\n", "| 52s \tmultiple, cleanly composed : 0.00% 1x\n", "| 52s \tnot mapped : 0.00% 1x\n", "\n", "**********************************************************************************************\n", "* *\n", "* 52s Mapping subhead nodes 0.4 ==> 1.0 *\n", "* *\n", "**********************************************************************************************\n", "\n", "| 52s Extending slot mapping 0.4 ==> 1.0 for subhead nodes\n", "| 52s \tDone\n", "..............................................................................................\n", ". 52s Statistics for 0.4 ==> 1.0 (subhead) .\n", "..............................................................................................\n", "| 52s \tTOTAL : 100.00% 1360x\n", "| 52s \tunique, perfect : 99.71% 1356x\n", "| 52s \tunique, imperfect : 0.29% 4x\n", "..............................................................................................\n", ". 52s Write edge as TF feature omap@0.4-1.0 .\n", "..............................................................................................\n", " 0.00s Exporting 0 node and 1 edge and 0 config features to ~/github/clariah/wp6-missieven/tf/1.0:\n", " | 8.50s T omap@0.4-1.0 to ~/github/clariah/wp6-missieven/tf/1.0\n", " 8.50s Exported 0 node features and 1 edge features and 0 config features to ~/github/clariah/wp6-missieven/tf/1.0\n" ] } ], "source": [ "V.makeVersionMapping()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a new feature in the latest version of the dataset:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 werk staff 43874572 May 6 15:48 /Users/werk/github/clariah/wp6-missieven/tf/1.0/omap@0.4-1.0.tf\n" ] } ], "source": [ "!ls -l ~/github/clariah/wp6-missieven/tf/{vb}/omap@*.tf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Observations\n", "\n", "1. in the vast majority of cases, a node in the source version has just one obvious counterpart in the target version\n", "2. most cases of ambiguity arise in the node type `line`.\n", "\n", "Maybe we can shed some light on those cases.\n", "\n", "First we load the mapping as a TF edge feature:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are interested in mappings between line nodes which are diagnosed as **multiple, non-perfect**.\n", "First we ask for the list of diagnostic labels:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b = unique, perfect\n", "d = multiple, one perfect\n", "c = unique, imperfect\n", "f = multiple, cleanly composed\n", "e = multiple, non-perfect\n", "a = not mapped\n" ] } ], "source": [ "V.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to inspect line nodes in version 0.6 that have label `e`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "9523\n" ] } ], "source": [ "diags = V.getDiagnosis(node=\"line\", label=\"e\")\n", "print(type(diags))\n", "print(len(diags))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'VOOR ILE DE MAYO 25 februari 1610.'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T[va].text(diags[0])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{6018784: 18, 6018788: 4}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "V.edge[diags[0]]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dis=18 text=VOOR ILE DE MAYO De eerste drie brieven, door Both op reis naar Indië geschrovon, wijken niet af van \n", "dis= 4 text=25 februari 1610. \n" ] } ], "source": [ "for (lnb, dis) in V.edge[diags[0]].items():\n", " print(f\"dis={dis:>2} text={T[vb].text(lnb)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Explanation\n", "\n", "In the target version footnotes occupy lines themselves. Line breaks in footnotes now become line breaks in the text as a whole.\n", "So lines in the source version may become split into several parts when they have a reference to a multiline footnote.\n", "\n", "The mapping then detects the two target lines, each of which is an imperfect target of the source line.\n", "We cannot do much about it.\n", "\n", "We could have made another coding decision: line breaks in footnotes are different from line breaks in the body text.\n", "Then we would have a good correspondence between the lines in both versions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }