<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Sharing data features

## Explore additional data

Once you analyse a corpus, it is likely that you produce data that others can reuse.
Maybe you have defined a set of proper name occurrences, or special numerals, or you have computed part-of-speech assignments.

It is possible to turn these insights into *new features*, i.e. new `.tf` files with values assigned to specific nodes.

## Make your own data

New data is a product of your own methods and computations in the first place.
But how do you turn that data into new TF features?
It turns out that the last step is not that difficult.

If you can shape your data as a mapping (dictionary) from node numbers (integers) to values
(strings or integers), then TF can turn that data into a feature file for you with one command.

## Share your new data
You can then easily share your new features on GitHub, so that your colleagues everywhere
can try it out for themselves.

You can add such data on the fly, by passing a `mod={org}/{repo}/{path}` parameter,
or a bunch of them separated by commas.

If the data is there, it will be auto-downloaded and stored on your machine.

Let's do it.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import collections
import os

from tf.app import use

In [3]:
A = use("CLARIAH/wp6-missieven", hoist=globals())
VERSION = A.version

# Making data

We illustrate the data creation part by creating a new feature, `number`.
The idea is that we compute a number value for each word that looks like a number,
but that contains OCR errors.

We keep things simple.

We are interested in words that contain only digits and letters, and where the number of digits is greater than de number of letters.
We exclude words that consist of digits only.

We only work in original letter content.

Let's find them by hand coding.

In [4]:
results = []

digitRe = re.compile(r"[0-9]")

for w in F.otype.s("word"):
    chars = F.transo.v(w)
    if not chars:
        continue
    (letters, nDigits) = digitRe.subn("", chars)
    nLetters = len(chars) - nDigits
    if nLetters and nDigits > nLetters:
        results.append(w)

print(results[0:10])
len(results)

[11761, 28520, 30481, 31702, 36287, 37982, 37988, 106832, 112548, 119347]


4727

It happens quite a bit.

Let's have a quick look at the text of the results

In [5]:
print("\n".join(sorted(F.transo.v(w) for w in results)[0:20]))

0001b
0001b
0001b
0001b
0001b
000©
006½
022H
024½
03£
042V2
051|
052J
053f
053f
062|
0753A
084|
086j
087|


We want to map characters to digits.
To get a feel for that, inventorize the characters that occur in these words.

For each character, count how often it occurs and give at most 10 examples.

In [6]:
inventory = collections.defaultdict(list)

for w in results:
    for c in (trans := F.transo.v(w)):
        if not c.isdigit():
            inventory[c].append(trans)

len(inventory)

61

Quite a bit of different characters.

In [7]:
for c in sorted(inventory):
    examples = inventory[c]
    n = len(examples)
    showExamples = ", ".join(sorted(examples)[0:10])
    print(f"{c} ({n:>4}x) {showExamples}")

? (  15x) 12?, 144?, 1617?, 16?, 18?, 19?, 286?, 29?, 31?, 413?
A (   9x) 0753A, 13A, 273A, 343A, 3933A, 423A, 43A, 4743A, 553A
C (   1x) 540C3
D (   1x) 1685De
E (   1x) 194845En
H (   4x) 022H, 22H, 2328H, 252H
I (   5x) 217IM, I299v, I85, I85, I85
J (  96x) 052J, 1079J, 1092J, 10J, 10J, 110J, 115J, 1191J, 11J, 121378J
M (   3x) 217IM, 4047M, 564M
O (   4x) 1671Op, 27O4508, O86V2, ÏO011
P (   1x) P10
S (   1x) 16S6
U (   1x) 1U8
V (  76x) 042V2, 1014V2, 1019V2, 1062V4, 1062V4, 10V5, 12V2, 1364V2, 13V2, 14V2
a (   5x) 10a, 11a, 11a, 13a, 1684dat
b (  26x) 0001b, 0001b, 0001b, 0001b, 0001b, 1156bls, 121b, 121b, 121b, 121b
c (  59x) 10c, 12c, 12c, 13c, 13c, 13c, 14c, 14c, 14c, 15c
d (  14x) 100d, 14101de, 1684dat, 29d, d08, d08, d08, d08, d08, d08
e (2952x) 10e, 10e, 10e, 10e, 10e, 10e, 10e, 10e, 10e, 10e
f (  58x) 053f, 053f, 09f, 102f, 108f, 121f, 1222f, 137f, 14f, 14f
g (   9x) 16g, 22g, 28g, 36g, 430g, 6000g, 600g, 705g, 74g
h (   2x) 42h, 605h
i (   4x) 302061in, 496159in, 7897io, 

We decide to translate a few characters to numerals:

In [8]:
charMapping = {
    "o": 0,
    "ó": 0,
    "ö": 0,
    "Ö": 0,
    "I": 1,
    "J": 1,
    "ï": 1,
    "è": 6,
}

Now we translate all numerals with this mapping, and if the result is numeric and does not start with a 0,
we save the result in a mapping from nodes to numbers.

In [9]:
def cmap(chars):
    n = "".join(str(charMapping.get(c, c)) for c in chars)
    return int(n) if not n.startswith("0") and n.isdigit() else None


number = {w: n for w in results if (n := cmap(F.transo.v(w)))}
len(number)

114

In [10]:
print(number)

{11761: 1151, 368089: 670, 379197: 94001, 379568: 131, 396613: 141, 396656: 20621, 407164: 121, 430354: 121, 432757: 128181, 432879: 1241, 434920: 141, 462917: 621, 464624: 1241, 465415: 631, 472907: 3191, 473135: 9581, 483858: 8191, 486913: 10791, 498619: 8541, 533953: 261, 533968: 331, 535684: 6121, 557983: 77841, 618358: 261, 618871: 4021, 618877: 501, 627195: 261, 653407: 1741, 667437: 15301, 675324: 65931, 750255: 3231, 750445: 5021, 1019955: 10921, 1047395: 1371, 1068377: 52141, 1070934: 49141, 1079667: 2000, 1080766: 72771, 1118656: 4061, 1173348: 161, 1178433: 101, 1196647: 191, 1200319: 201, 1211567: 660, 1230723: 3501, 1234154: 171, 1237203: 111, 1237391: 141, 1250144: 8421, 1253186: 32091, 1271818: 121, 1282202: 75621, 1327325: 121, 1346403: 131, 1352127: 421, 1352309: 421, 1372543: 371, 1379628: 161, 1393864: 2228491, 1443457: 161, 1443464: 361, 1443641: 361, 1443657: 361, 1443666: 101, 1451420: 2981, 1548082: 1101, 1554393: 421, 1653139: 2501, 1669175: 151, 1682688: 4041, 

# Saving data

In [annotate](annotate.ipynb) we saw how to save features.
We do the same for the `number` feature.

In [11]:
GITHUB = os.path.expanduser("~/github")
ORG = A.context.org
REPO = A.context.repo
PATH = "exercises/numerics"

Later on, we pass this version on, so that users of our data will get the shared data in exactly the same version as their core data.

We have to specify a bit of metadata for this feature:

In [12]:
metaData = {
    "number": dict(
        valueType="int",
        description="numeric value of corrected number-like strings",
        creator="Dirk Roorda",
    ),
}

Now we can give the save command:

In [14]:
location = f"{GITHUB}/{ORG}/{REPO}/{PATH}/tf"
TF.save(
    nodeFeatures=dict(number=number),
    metaData=metaData,
    location=location,
    module=VERSION,
    silent="auto",
)

  0.00s Exporting 1 node and 0 edge and 0 config features to ~/github/CLARIAH/wp6-missieven/exercises/numerics/tf/1.0:
   |     0.00s T number               to ~/github/CLARIAH/wp6-missieven/exercises/numerics/tf/1.0
  0.00s Exported 1 node features and 0 edge features and 0 config features to ~/github/CLARIAH/wp6-missieven/exercises/numerics/tf/1.0


True

Here is the data in text-fabric format: a feature file

In [15]:
with open(f"{location}/{VERSION}/number.tf") as fh:
    print(fh.read())

@node
@creator=Dirk Roorda
@description=numeric value of corrected number-like strings
@valueType=int
@writtenBy=Text-Fabric
@dateWritten=2022-10-11T14:56:42Z

11761	1151
368089	670
379197	94001
379568	131
396613	141
396656	20621
407164	121
430354	121
432757	128181
432879	1241
434920	141
462917	621
464624	1241
465415	631
472907	3191
473135	9581
483858	8191
486913	10791
498619	8541
533953	261
533968	331
535684	6121
557983	77841
618358	261
618871	4021
618877	501
627195	261
653407	1741
667437	15301
675324	65931
750255	3231
750445	5021
1019955	10921
1047395	1371
1068377	52141
1070934	49141
1079667	2000
1080766	72771
1118656	4061
1173348	161
1178433	101
1196647	191
1200319	201
1211567	660
1230723	3501
1234154	171
1237203	111
1237391	141
1250144	8421
1253186	32091
1271818	121
1282202	75621
1327325	121
1346403	131
1352127	421
1352309	421
1372543	371
1379628	161
1393864	2228491
1443457	161
1443464	361
1443641	361
1443657	361
1443666	101
1451420	2981
1548082	1101
1554393	421
1653139	2501
166917

# Sharing data

How to share your own data is explained in the
[documentation](https://annotation.github.io/text-fabric/tf/about/datasharing.html).

Here we show it step by step for the `number` feature.

If you commit your changes to the exercises repo, and have done a `git push origin master`,
you already have shared your data!

**Keep it simple for small datasets:
For small feature datasets, you are done.**

If it gets serious, there is support for releases and efficient data transfer.
Here is how:

**Note (releases)**

If you want to make a stable release, so that you can keep developing, while your users fall back
on the stable data, you can make a new release.

Go to the GitHub website for that, go to your repo, and click *Releases* and follow the nudges.

**Note (release binaries)**

If you want to make it even smoother for your users, you can zip the data and attach it as a binary to the release just created.

We need to zip the data in exactly the right directory structure. Text-Fabric can do that for us.


In [17]:
%%sh

text-fabric-zip CLARIAH/wp6-missieven/exercises/numerics/tf

This is a TF dataset
Create release data for CLARIAH/wp6-missieven/exercises/numerics/tf
Found 2 versions
zip files end up in ~/Downloads/None/CLARIAH-release/wp6-missieven
zipping CLARIAH/wp6-missieven     0.9.1 with   1 features ==> exercises-numerics-tf-0.9.1.zip
zipping CLARIAH/wp6-missieven      1.0 with   1 features ==> exercises-numerics-tf-1.0.zip


All versions have been zipped, but it works OK if you only attach the newest version to the newest release.

If a user asks for an older version in this release, the system can still find it.

# Use the data

We can use the data by calling it up when we say `use('CLARIAH/wp6-missieven', ...)`
where we put in a data module argument on the dots.
We will also call up the entity data we created in the [annotate](annotate.ipynb) chapter.

Note that for each module we can specify flags like `:latest`, `:hot`, `clone`.

If you are the author of the data, and want to test it, use `:clone`: it takes the data from where you saved it.

If you are a new user of the data, use `:hot` (get latest commit) or `:latest` (get latest release)
to download the data.

If you have downloaded the data before, leave out the flag.

In [18]:
A = use(
    f"CLARIAH/wp6-missieven",
    hoist=globals(),
    mod=(
        f"CLARIAH/wp6-missieven/exercises/entities/tf",
        f"CLARIAH/wp6-missieven/exercises/numerics/tf",
    ),
    version=VERSION,
    silent=False,
)

This is Text-Fabric 10.2.6
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

49 features found and 0 ignored
  4.57s All features loaded/computed - for details use TF.isLoaded()
  0.42s All additional features loaded - for details use TF.isLoaded()


Above you see a new sections in the feature list that you can expand to see
which features that module contributed.

Now, suppose did not know much about these feature, then we would like to do a few basic checks.

A good start it to do inspect a frequency list of the values of the new features,
and then to perform a query looking for the nodes that have these features.

We do that for the entity features and for the number feature.

## Entities

In [19]:
F.entityId.freqList()

(('T11', 6),
 ('T2', 5),
 ('T13', 3),
 ('T16', 3),
 ('T8', 3),
 ('T9', 3),
 ('T10', 2),
 ('T15', 2),
 ('T17', 2),
 ('T3', 2),
 ('T5', 2),
 ('T1', 1),
 ('T12', 1),
 ('T4', 1),
 ('T6', 1),
 ('T7', 1))

In [20]:
F.entityKind.freqList()

(('Person', 18), ('GPE', 15), ('Organization', 5))

In [21]:
F.entityComment.freqList()

(('Ternate', 5), ('Amboina', 2))

Let's query all words that have an entity notation:

In [22]:
query = """
word entityId entityKind* entityComment*
"""
results = A.search(query)

  4.40s 23 results


Here we query all word where the `entityId` is present.
We also mention the `entityKind` and `entityComment` features, but with a `*` behind them.
That is a criterion that is always True, so these mentions do not alter the result list.
But now these features do occur in the query, and when we show results, these features will be shown.

In [23]:
A.show(results, condensed=True)

**Observation**

It's not only words that have entity features, also the lines themselves have gotten such annotations.

It turns out that it is not very useful to annotate *lines* with entities this way.
It would be better to annotate them with the number of entities they contain.
That is our feedback to the creator of these annotations, and because we know the GitHub repo that they are from,
we can file an [issue](https://github.com/annotation/tutorials/issues/3)!

## Numerics

In [24]:
F.number.freqList()

((121, 6),
 (361, 5),
 (421, 4),
 (101, 3),
 (141, 3),
 (161, 3),
 (185, 3),
 (261, 3),
 (131, 2),
 (151, 2),
 (240, 2),
 (360, 2),
 (621, 2),
 (1241, 2),
 (1441, 2),
 (111, 1),
 (171, 1),
 (181, 1),
 (191, 1),
 (201, 1),
 (241, 1),
 (250, 1),
 (281, 1),
 (291, 1),
 (331, 1),
 (371, 1),
 (480, 1),
 (501, 1),
 (541, 1),
 (561, 1),
 (631, 1),
 (660, 1),
 (670, 1),
 (701, 1),
 (721, 1),
 (731, 1),
 (761, 1),
 (814, 1),
 (901, 1),
 (911, 1),
 (1101, 1),
 (1151, 1),
 (1321, 1),
 (1371, 1),
 (1661, 1),
 (1741, 1),
 (2000, 1),
 (2501, 1),
 (2921, 1),
 (2981, 1),
 (2991, 1),
 (3191, 1),
 (3231, 1),
 (3501, 1),
 (4021, 1),
 (4041, 1),
 (4061, 1),
 (5021, 1),
 (6121, 1),
 (8191, 1),
 (8421, 1),
 (8541, 1),
 (9581, 1),
 (9961, 1),
 (10791, 1),
 (10921, 1),
 (11911, 1),
 (15301, 1),
 (15361, 1),
 (20621, 1),
 (32091, 1),
 (49141, 1),
 (52141, 1),
 (65931, 1),
 (72771, 1),
 (75621, 1),
 (77841, 1),
 (94001, 1),
 (98771, 1),
 (128181, 1),
 (167081, 1),
 (925981, 1),
 (977381, 1),
 (1213781, 1),
 (22

We see that the values that we have generated before.

Let's show the original and the number side by side.

In [25]:
results = A.search(
    """
word number transo*
"""
)

  1.87s 114 results


In [26]:
A.show(results, start=1, end=10)

# All together!

If more researchers have shared data modules, you can draw them all in.

Then you can design queries that use features from all these different sources.

In that way, you build your own research on top of the work of others.

Hover over the features to see where they come from, and you'll see they come from your local GitHub repo.

# For real

See the [next tutorial in this series](entities.ipynb) how you can
draw in and make use additional features produced by a serious algorithm to detect
named entities.

---

# Contents

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **share** draw in other people's data and let them use yours
* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)
* **[porting](porting.ipynb)** port features made against an older version to a newer version
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda