In [1]:
%load_ext autoreload
%autoreload 2

# Split

We split a big dataset into volumes.

In [21]:
import os
from tf.fabric import Fabric
from tf.compose import split
from tf.core.helpers import unexpanduser

In [3]:
GH = os.path.expanduser("~/github")
GM = f"{GH}/Dans-labs/clariah-gm"
VERSION = "0.9.1"
SOURCE = f"{GM}/tf/{VERSION}"
TARGET = f"{GM}/_local/tf/{VERSION}"

# Loading

We load the dataset, and pass its API to the `split()` function.

If something goes wrong during the split, we can inspect the dataset without reloading it.

In a normal scenario, we can just leave out this step. The `split()` function will
automatically load the dataset if no `api` argument is passed.

In [4]:
TF = Fabric(locations=SOURCE)
api = TF.loadAll()
api.makeAvailableIn(globals())

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

40 features found and 0 ignored
 0.00s loading features ...
 9.51s All features loaded/computed - for details use TF.loadLog()
 0.00s loading features ...
 0.65s All additional features loaded - for details use TF.loadLog()


[('Computed',
 'computed-data',
 ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
 'node-features',
 ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Selective splitting

Splitting happens at top level sections.

We can restrict the processing to just one volume, for debugging purposes.
The volume you are interested in, can be passed in the `volume=` optional parameter.
Set is to `None` or leave it out to process all volumes.

In [13]:
V = None

In [18]:
volumes = split(SOURCE, TARGET, api=api, overwrite=True, volume=V)

Splitting dataset in 13 volumes:
 | Volume 1 : with slots 1 - 358356
 | Volume 2 : with slots 358357 - 765208
 | Volume 3 : with slots 765209 - 1213807
 | Volume 4 : with slots 1213808 - 1589004
 | Volume 5 : with slots 1589005 - 2008807
 | Volume 6 : with slots 2008808 - 2450424
 | Volume 7 : with slots 2450425 - 2850492
 | Volume 8 : with slots 2850493 - 2977520
 | Volume 9 : with slots 2977521 - 3447394
 | Volume 10 : with slots 3447395 - 4089689
 | Volume 11 : with slots 4089690 - 4594831
 | Volume 12 : with slots 4594832 - 4941618
 | Volume 13 : with slots 4941619 - 5316429
 1.91s volume splits determined
 1.91s Distribute nodes over volumes ...
 | 0.30s volume 1 with 398465 nodes ...
 | 0.64s volume 2 with 451426 nodes ...
 | 1.03s volume 3 with 500564 nodes ...
 | 1.36s volume 4 with 421357 nodes ...
 | 1.73s volume 5 with 467818 nodes ...
 | 2.10s volume 6 with 490255 nodes ...
 | 2.44s volume 7 with 440890 nodes ...
 | 2.58s volume 8 with 140198 nodes ...
 | 2.93s volume 9 wit

# Checkout the volumes

The `split()` function returns basic information about the volumes:

* title
* top node in the original dataset
* top node in the volume dataset
* location of the volume dataset on the file system

In [None]:
for v in volumes:
 print(f"{v[0]:<20} original volume node={v[1]:>8} top node={v[2]:>7} at {unexpanduser(v[3])}")

# Load all volumes

We use the result of the `split()` function to find and load all volumes.

We now get one TF-API handle per volume.

## volumeMap

Note that each volume has an extra feature: `volumeMap`. The value for each node in the volume dataset
is the corresponding node in the complete dataset from which the volume is taken.

If you use the volume dataset to compute annotations, and you want to publish these annotations against the complete dataset, the feature `volumeMap` provides the necessary information to do so.

Suppose `annotvx` is a dict mapping the some nodes in the dataset of volume `x` to interesting values, then you apply them to the big dataset as follows

``` python

{F.volumeMap.v(n): value for (n, value) in annotvx.items}
```

In [23]:
TFs = {}
apis = {}
for (vt, vn, vl, vloc) in volumes:
 TFs[vt] = Fabric(locations=vloc)
 apis[vt] = TFs[vt].loadAll()

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

41 features found and 0 ignored
 0.00s loading features ...
 | 0.13s T otype from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.87s T oslots from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.02s T puncr from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.09s T transn from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.05s T n from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.08s T puncn from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.00s T title from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.73s T trans from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.61s T transo from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.62s T punc from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.51s T punco from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1/1
 | 0.02s T transr from ~/github/Dans-labs/clariah-gm/_local/tf/0.9.1