<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Porting annotations

In the [entities](entities.ipynb) notebook we saw how we could use third-party features
with our corpus. There we promised to show how we can upgrade such features so that they
also work against newer versions of the corpus.

Text-Fabric has machinery to help with that. It turns out that we have
to make a mapping between the slots of both versions, and then Text-Fabric can do the rest.

With that mapping in hand, we can port *all* features, past, present and future, 
automatically from the older version to the newer, and vice versa.

But: there will be imperfections, unavoidably.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

from tf.app import use

# Make a slot mapping

In this notebook we map the *slot* nodes from version 0.8.1 (source version) to 1.0 (target version).

Basically this means that we map all slots from the source version to corresponding slots in the target version.

Some slots have an empty text (most of them contain some punctuation).

We do not want to be fussy about those slots.
We map them unto corresponding empty slots if possible, otherwise we map them onto the nearest
non-empty slot.

After establishing the slot mapping, we extend the mapping to all nodes in a generic way.
The code for this is already in the TF library.

Note that in this stage we do not need the entity features at all. All we do is
to compare two version of the base corpus.

In [3]:
from tf.dataset import Versions

va = "0.8.1"
vb = "1.0"

We load the data for both versions.

This time, we work in our GitHub clone, because we want to make the resulting
map available to everyone, after pushing the clone to GitHub.

In [4]:
A = {}

In [5]:
for v in (va, vb):
    A[v] = use(
        "CLARIAH/wp6-missieven:clone",
        checkout="clone",
        silent="deep",
        version=v
    )

We walk through the slots of the target version.
For each target slot we increase the slot in the source version, and check whether
source and target slots have the same value for the `trans` feature.
If not, and one of them is empty, we skip the empty word and try the next one.
But if both are not empty and unequal, we have a real problem: a mismatch.

In that case we stop, and you have to inspect what is happening.

In [6]:
def makeSlotMap():
    Fa = A[va].api.F
    Fb = A[vb].api.F
    transA = Fa.trans.v
    transB = Fb.trans.v
    maxSlotA = Fa.otype.maxSlot
    maxSlotB = Fb.otype.maxSlot

    print(
        f"""\
    Computing slotMap between:
    {va}: {maxSlotA:>8} slots,
    {vb}: {maxSlotB:>8} slots.\
"""
    )

    slotMap = {}

    good = True
    wA = 1
    wB = 1

    while wB <= maxSlotB and wA <= maxSlotA:
        textA = transA(wA) or ""
        textB = transB(wB) or ""

        if textA == textB:
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
            wB += 1

        elif textA.startswith(textB):
            slotMap.setdefault(wA, {})[wB] = None
            wB += 1
        elif textA.endswith(textB):
            wA += 1
            wB += 1

        elif textB.startswith(textA):
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
        elif textB.endswith(textA):
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
            wB += 1

        else:
            print("Mismatch:")
            print(f"A: {wA:>8} = `{textA}`")
            print(f"B: {wB:>8} = `{textB}`")
            good = False
            break

    maxSlotMap = max(slotMap)
    if maxSlotMap > maxSlotA:
        print(f"maxSlot in A version {va} exceeded")
        print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
        good = False

    if good:
        print(
            f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
"""
        )
    return slotMap

In [7]:
slotMap = makeSlotMap()

    Computing slotMap between:
    0.8.1:  5316429 slots,
    1.0:  5977367 slots.
slotMap succesfully created: 5316429 slots mapped.



Note that as of version 1.0 volume 14 has been included.

So we expect a discrepancy there.

And of course, we will not have entity feature values for volume 14.

When we encounter problems, we can do a bit of checking to see what is going on.

The next function shows the line around a slot node, and can do so in both versions.

In [8]:
def show(v, n):
    F = A[v].api.F
    L = A[v].api.L
    T = A[v].api.T
    
    lines = L.u(n, otype="line")
    if not lines:
        lines = L.u(n + 1, otype="line")
    if not lines:
        lines = L.u(n - 1, otype="line")
    if not lines:
        print("no such line")
        return
    line = lines[0]
    print(T.sectionFromNode(line))
    words = L.d(line, otype="word")
    print(" ".join(f"[{w}={F.trans.v(w)}]" for w in words))
    print(T.text(line))

In [9]:
show(va, 49)
show(vb, 49)

(1, 3, 4)
[46=Generaal] [47=aan] [48=boord] [49=vertoefde] [50=verandert] [51=daaraan] [52=niets] [53=Ile] [54=de] [55=Mayo] [56=is] [57=een] [58=der] [59=Kaap] [60=Verdische]
Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische 
(1, 3, 4)
[45=Generaal] [46=aan] [47=boord] [48=vertoefde] [49=verandert] [50=daaraan] [51=niets] [52=Ile] [53=de] [54=Mayo] [55=is] [56=een] [57=der] [58=Kaap] [59=Verdische]
Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische 


## Make the complete node map

We now extend the `slotMap` to a full node map.

See [dataset.Versions](https://annotation.github.io/text-fabric/tf/dataset/nodemaps.html#tf.dataset.nodemaps.Versions) in the Text-Fabric documentation.

In [10]:
V = Versions({v: A[v].api for v in (va, vb)}, va, vb, silent="auto", slotMap=slotMap)

In [11]:
V.makeVersionMapping()

    12s 
    **********************************************************************************************
    *                                                                                            *
    * Mapping volume nodes 0.8.1 ==> 1.0                                                         *
    *                                                                                            *
    **********************************************************************************************
    
    23s ..............................................................................................
    . Statistics for 0.8.1 ==> 1.0 (volume)                                                      .
    ..............................................................................................
    23s | 	TOTAL                          : 100.00%      13x
    23s | 	unique, perfect                :  92.31%      12x
    23s | 	multiple, non-perfect          :   7.69%       1x
    23s

## Migrate the features

Now we return to the entity features.

It seems that the node map is not perfect, but we did not expect that.

We migrate the entity features nevertheless.

Remember that they are not in the corpus, but in a third party module of features.

For the sake of persistence, I have copied the features to this repo in directory `voc-missives/export/tf`.

We load the older version of the corpus again, now with the entity features for that version.
We leave the loaded newer version of the corpus in memory.

In [12]:
# THIRD_PARTY = "cltl/voc-missives/export/tf"
THIRD_PARTY = "CLARIAH/wp6-missieven/voc-missives/export/tf"

api = {}
api[vb] = A[vb].api

A[va] = use(
    "CLARIAH/wp6-missieven:clone",
    checkout="clone",
    mod=f"{THIRD_PARTY}:clone",
    silent="deep",
    version=va,
)
api[va] = A[va].api

We are going to produce the upgraded entities in `voc-missives/migrated/tf`.

In [13]:
V = Versions(api, va, vb, silent="auto")
V.migrateFeatures(
    ("entityId", "entityKind"),
    location=f"~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf",
)

    20s start migrating
   |       47s T omap@0.8.1-1.0       from ~/github/CLARIAH/wp6-missieven/tf/1.0
    48s All additional features loaded - for details use TF.isLoaded()
    48s Mapping entityId (node)
    48s Mapping entityKind (node)
  0.00s Exporting 2 node and 0 edge and 0 config features to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0:
   |     0.03s T entityId             to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0
   |     0.03s T entityKind           to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0
  0.05s Exported 2 node features and 0 edge features and 0 config features to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0
  0.05s Done


## Check

Now let's check by loading the new version of the corpus with the migrated entities,
and perform the same query as in the [entities](entities.ipynb) notebook.

In [14]:
A = use(
    "CLARIAH/wp6-missieven:clone",
    checkout="clone",
    mod="CLARIAH/wp6-missieven/voc-missives/migrated/tf:clone",
    hoist=globals(),
    version="1.0",
    silent="verbose",
)

This is Text-Fabric 10.2.6
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

47 features found and 2 ignored
  5.47s All features loaded/computed - for details use TF.isLoaded()
   |     0.17s T entityId             from ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0
   |     0.14s T entityKind           from ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0
  0.75s All additional features loaded - for details use TF.isLoaded()


Note that you can click the triangle before *CLARIAH/wp6-missieven/voc-missives/migrated/tf*,
to see which features are used.
You can then click further on the triangle before the feature data type, to see more information
about that feature, including the fact that it is an upgraded feature.

```
creator: Sophie Arnoult
dateWritten: 2022-10-11T10:42:45Z
upgraded: ‼️ from version 0.8.1 to 1.0
writtenBy: Text-Fabric
```

We are going to do a bit of research into the upgraded features.

In [15]:
F.entityId.freqList()[0:20]

(('e_n12_2_632', 8),
 ('e_n13_15_2306', 8),
 ('e_n7_8_809', 8),
 ('e_n13_15_1302', 7),
 ('e_n7_8_1080', 7),
 ('e_t10_15_108', 7),
 ('e_t10_15_273', 7),
 ('e_n10_11_715', 6),
 ('e_n12_14_130', 6),
 ('e_n12_2_383', 6),
 ('e_n12_2_578', 6),
 ('e_n13_15_154', 6),
 ('e_n13_15_1582', 6),
 ('e_n13_15_1894', 6),
 ('e_n13_15_285', 6),
 ('e_n5_28_103', 6),
 ('e_n5_28_34', 6),
 ('e_n5_28_675', 6),
 ('e_n5_7_515', 6),
 ('e_n8_6_710', 6))

In [16]:
len(F.entityId.freqList())

24500

In [17]:
F.entityKind.freqList()

(('LOC', 12790),
 ('PER', 10393),
 ('LOCderiv', 4279),
 ('ORG', 3841),
 ('SHP', 2922),
 ('GPE', 1153),
 ('RELderiv', 261),
 ('ORGpart', 58),
 ('LOCpart', 45),
 ('RELpart', 28),
 ('REL', 19))

In [18]:
query = """
word entityId entityKind*
"""
results = A.search(query)

  1.96s 32249 results


In [19]:
A.show(results, condensed=True, end=10)

Let's view the distribution of named entities over the volumes.

We run a query looking for words with a named entity within a volume.

In [20]:
query = """
volume
  word entityId
"""
results = A.search(query)

  2.46s 32249 results


Now we process the results, which are tuples consisting of a volume node and a 
word node.

In [21]:
eDist = collections.Counter()

for (vol, word) in results:
    eDist[F.n.v(vol)] += 1
    
eDist

Counter({1: 1451,
         2: 1150,
         3: 1536,
         4: 1758,
         5: 3032,
         6: 2864,
         7: 2343,
         8: 1685,
         9: 1870,
         10: 1933,
         11: 4695,
         12: 1909,
         13: 6023})

It is apparent that there are no entities in volume 14, because in version 0.8.1. there was no volume 14.

So it is preferable that the third party repeats the entity recognition
on the new version of the corpus, so that the entities in volume 14 get recognized too.

This has in fact happened. Sophie Arnoult has run the machinery again.

Let's quickly load that version and compute the distribution of entities there.

In [22]:
A = use(
    "CLARIAH/wp6-missieven:clone",
    checkout="clone",
    mod="CLARIAH/wp6-missieven/voc-missives/export/tf:clone",
    hoist=globals(),
    version="1.0",
    silent="verbose",
)

This is Text-Fabric 10.2.6
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

47 features found and 2 ignored
  5.66s All features loaded/computed - for details use TF.isLoaded()
  0.54s All additional features loaded - for details use TF.isLoaded()


In [23]:
results = A.search(query)

eDist = collections.Counter()

for (vol, word) in results:
    eDist[F.n.v(vol)] += 1
    
eDist

  2.46s 29159 results


Counter({1: 2208,
         2: 1799,
         3: 2297,
         4: 1987,
         5: 1858,
         6: 4295,
         7: 3068,
         8: 1251,
         9: 2825,
         10: 1745,
         11: 2455,
         12: 1350,
         13: 977,
         14: 1044})

Clearly, there have been additional changes leading to a very different version 1.0 than 0.8.1,
so at this point in time the migrated features (from 0.8.1 to 1.0) are practically obsolete.

---

# Contents

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)
* **porting** port features made against an older version to a newer version
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda