{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "---\n", "Start with [convert](https://nbviewer.jupyter.org/github/annotation/banks/blob/master/programs/convert.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Compose\n", "\n", "This is about combining multiple TF datasets into one, and then tweaking it further.\n", "\n", "In the previous chapters of this tutorial you have learned how to add new features to an existing dataset.\n", "\n", "Here you learn how you can combine dozens of slightly heterogeneous TF data sets,\n", "and apply structural tweaks to the node types and features later on.\n", "\n", "The incentive to write these composition functions into Text-Fabric came from Ernst Boogert while he was\n", "converting between 100 and 200 works by the Church Fathers (Patristics).\n", "The conversion did a very good job in getting all the information from TEI files with different structures into TF,\n", "one dataset per work.\n", "\n", "Then the challenge became to combine them into one big dataset, and to merge several node types into one type,\n", "and several features into one.\n", "\n", "See [patristics](https://github.com/pthu/patristics)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new functions are `collect()` and `modify()`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from tf.fabric import Fabric\n", "from tf.dataset import modify\n", "from tf.volumes import collect" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Corpus\n", "\n", "We use two copies of our example corpus Banks, present in this repository." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Collect\n", "\n", "The *collect* function takes any number of directory locations, and considers each location to be the\n", "host of a TF data set.\n", "\n", "You can pass this list straight to the `collect()` function as the `locations` parameter,\n", "or you can add names to the individual corpora.\n", "In that case, you pass an iterable of (`name`, `location`) pairs into the `locations` parameter.\n", "\n", "Here we give the first copy the name `banks`, and the second copy the name `river`.\n", "\n", "We also specify the output location." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "PREFIX = \"combine/input\"\n", "SUFFIX = \"tf/0.2\"\n", "\n", "locations = (\n", " (\"banks\", f\"{PREFIX}/banks1/{SUFFIX}\"),\n", " (\"rivers\", f\"{PREFIX}/banks2/{SUFFIX}\"),\n", ")\n", "\n", "COMBINED = \"combine/_temp/riverbanks\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to call the `collect()` function.\n", "\n", "But first we clear the output location.\n", "\n", "Note how you can mix a bash-shell command with your Python code." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Loading volume banks from combine/input/banks1/tf/0.2 ...\n", "This is Text-Fabric 9.1.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "10 features found and 0 ignored\n", " 0.00s loading features ...\n", " 0.01s All features loaded/computed - for details use TF.isLoaded()\n", " | 0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 8 computed\n", " 0.00s loading features ...\n", " 0.00s All additional features loaded - for details use TF.isLoaded()\n", " 0.02s Loading volume rivers from combine/input/banks2/tf/0.2 ...\n", "This is Text-Fabric 9.1.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "10 features found and 0 ignored\n", " 0.00s loading features ...\n", " 0.01s All features loaded/computed - for details use TF.isLoaded()\n", " | 0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 8 computed\n", " 0.00s loading features ...\n", " 0.00s All additional features loaded - for details use TF.isLoaded()\n", " 0.04s inspect metadata ...\n", "WARNING: otext.structureFeatures metadata varies across volumes\n", "WARNING: otext.structureTypes metadata varies across volumes\n", "WARNING: author.compiler metadata varies across volumes\n", "WARNING: author.purpose metadata varies across volumes\n", "WARNING: letters.description metadata varies across volumes\n", " 0.04s metadata sorted out\n", " 0.04s check nodetypes ...\n", " | volume banks\n", " | volume rivers\n", " 0.04s node types ok\n", " 0.04s Collect nodes from volumes ...\n", " | 0.00s Check against overlapping slots ...\n", " | | banks : 99 slots\n", " | | rivers : 99 slots\n", " | 0.00s no overlap\n", " | 0.00s Group non-slot nodes by type\n", " | | banks : 100- 117\n", " | | rivers : 100- 117\n", " | 0.00s Mapping nodes from volume to/from work ...\n", " | | book : 199 - 200\n", " | | chapter : 201 - 204\n", " | | line : 205 - 228\n", " | | sentence : 229 - 234\n", " | 0.01s The new work has 236 nodes of which 198 slots\n", " 0.05s collection done\n", " 0.05s remap features ...\n", " 0.05s remapping done\n", " 0.05s write work as TF data set\n", " 0.07s writing done\n", " 0.07s done\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output = COMBINED\n", "\n", "!rm -rf {output}\n", "\n", "collect(\n", " locations,\n", " output,\n", " volumeType=\"volume\",\n", " volumeFeature=\"title\",\n", " featureMeta=dict(\n", " otext=dict(\n", " sectionTypes=\"volume,chapter,line\",\n", " sectionFeatures=\"title,number,number\",\n", " **{\"fmt:text-orig-full\": \"{letters} \"},\n", " ),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function is a bit verbose in its output, but a lot happens under the hood, and if your dataset is large,\n", "it may take several minutes. It is pleasant to see the progress under those circumstances.\n", "\n", "But for now, we pass `silent=True`, to make everything a bit more quiet." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: otext.structureFeatures metadata varies across volumes\n", "WARNING: otext.structureTypes metadata varies across volumes\n", "WARNING: author.compiler metadata varies across volumes\n", "WARNING: author.purpose metadata varies across volumes\n", "WARNING: letters.description metadata varies across volumes\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output = COMBINED\n", "\n", "!rm -rf {output}\n", "\n", "collect(\n", " locations,\n", " output,\n", " volumeType=\"volume\",\n", " volumeFeature=\"title\",\n", " featureMeta=dict(\n", " otext=dict(\n", " sectionTypes=\"volume,chapter,line\",\n", " sectionFeatures=\"title,number,number\",\n", " **{\"fmt:text-orig-full\": \"{letters} \"},\n", " ),\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There you are, on your file system you see the combined dataset:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 88\n", "-rw-r--r-- 1 dirk staff 559 Nov 4 16:04 author.tf\n", "-rw-r--r-- 1 dirk staff 524 Nov 4 16:04 gap.tf\n", "-rw-r--r-- 1 dirk staff 1619 Nov 4 16:04 letters.tf\n", "-rw-r--r-- 1 dirk staff 548 Nov 4 16:04 number.tf\n", "-rw-r--r-- 1 dirk staff 681 Nov 4 16:04 oslots.tf\n", "-rw-r--r-- 1 dirk staff 1062 Nov 4 16:04 otext.tf\n", "-rw-r--r-- 1 dirk staff 485 Nov 4 16:04 otype.tf\n", "-rw-r--r-- 1 dirk staff 2747 Nov 4 16:04 ovolume.tf\n", "-rw-r--r-- 1 dirk staff 640 Nov 4 16:04 punc.tf\n", "-rw-r--r-- 1 dirk staff 494 Nov 4 16:04 terminator.tf\n", "-rw-r--r-- 1 dirk staff 563 Nov 4 16:04 title.tf\n" ] } ], "source": [ "!ls -l {output}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we compare that with one of the input:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 80\n", "-rw-r--r-- 1 dirk staff 359 May 20 2019 author.tf\n", "-rw-r--r-- 1 dirk staff 409 May 20 2019 gap.tf\n", "-rw-r--r-- 1 dirk staff 911 May 20 2019 letters.tf\n", "-rw-r--r-- 1 dirk staff 421 May 20 2019 number.tf\n", "-rw-r--r-- 1 dirk staff 419 May 20 2019 oslots.tf\n", "-rw-r--r-- 1 dirk staff 572 May 20 2019 otext.tf\n", "-rw-r--r-- 1 dirk staff 372 May 30 2019 otype.tf\n", "-rw-r--r-- 1 dirk staff 457 May 20 2019 punc.tf\n", "-rw-r--r-- 1 dirk staff 377 May 20 2019 terminator.tf\n", "-rw-r--r-- 1 dirk staff 361 May 20 2019 title.tf\n" ] } ], "source": [ "!ls -l {PREFIX}/banks1/{SUFFIX}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "then we see the same files (with the addition of *ovolume.tf*\n", "but smaller file sizes.\n", "\n", "## Result\n", "\n", "Let's have a look inside, and note that we use the TF function `loadAll()`\n", "which loads all loadable features." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 9.1.3\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "11 features found and 0 ignored\n", " 0.00s loading features ...\n", " | 0.00s T otype from combine/_temp/riverbanks\n", " | 0.00s T oslots from combine/_temp/riverbanks\n", " | 0.00s Dataset without structure sections in otext:no structure functions in the T-API\n", " | 0.00s T number from combine/_temp/riverbanks\n", " | 0.00s T punc from combine/_temp/riverbanks\n", " | 0.00s T gap from combine/_temp/riverbanks\n", " | 0.00s T terminator from combine/_temp/riverbanks\n", " | 0.00s T title from combine/_temp/riverbanks\n", " | 0.00s T letters from combine/_temp/riverbanks\n", " | | 0.00s C __levels__ from otype, oslots, otext\n", " | | 0.00s C __order__ from otype, oslots, __levels__\n", " | | 0.00s C __rank__ from otype, __order__\n", " | | 0.00s C __levUp__ from otype, oslots, __rank__\n", " | | 0.00s C __levDown__ from otype, __levUp__, __rank__\n", " | | 0.00s C __boundary__ from otype, oslots, __rank__\n", " | | 0.00s C __sections__ from otype, oslots, otext, __levUp__, __levels__, title, number, number\n", " 0.03s All features loaded/computed - for details use TF.isLoaded()\n", " | 0.00s Feature overview: 9 for nodes; 1 for edges; 1 configs; 8 computed\n", " 0.00s loading features ...\n", " | 0.00s T author from combine/_temp/riverbanks\n", " | 0.00s T ovolume from combine/_temp/riverbanks\n", " 0.01s All additional features loaded - for details use TF.isLoaded()\n" ] } ], "source": [ "TF = Fabric(locations=COMBINED)\n", "api = TF.loadAll(silent=False)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We look up the section of the first word:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('banks', 1, 1)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.sectionFromNode(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The component sets had 99 words each. So what is the section of word 100?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('rivers', 1, 1)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.sectionFromNode(100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right, that's the first word of the second component.\n", "\n", "Here is an overview of all the node types in the combined set.\n", "\n", "The second field is the average length in words for nodes of that type, the remaining fields give\n", "the first and last node of that type." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('book', 99.0, 199, 200),\n", " ('volume', 99.0, 235, 236),\n", " ('chapter', 49.5, 201, 204),\n", " ('sentence', 33.0, 229, 234),\n", " ('line', 7.666666666666667, 205, 228),\n", " ('word', 1, 1, 198))" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The combined data set consists of the concatenation of all slot nodes of the component data sets.\n", "\n", "Note that the individual components have got a top node, of type `volume`.\n", "This is the effect of specifying `componentType='volume'`.\n", "\n", "There is also a feature for volumes, named `title`, that contains their name, or if we haven't passed their names\n", "in the `locations` parameter, their location.\n", "This is the effect of `componentFeature='title'`.\n", "\n", "Let's check.\n", "\n", "We use the new `.items()` method on features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(199, 'Consider Phlebas'), (200, 'Consider Phlebas'), (235, 'banks'), (236, 'rivers')])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.title.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see several things:\n", "\n", "* the volume nodes indeed got the component name in the feature `title`\n", "* the other nodes that already had a title, the `book` nodes, still have the same value for `title` as before.\n", "\n", "### The merging principle\n", "\n", "This is a general principle that we see over and over again: when we combine data, we merge as much as possible.\n", "\n", "That means that when you create new features, you may use the names of old features, and the new information for that\n", "feature will be merged with the old information of that feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modify\n", "\n", "Although combining has its complications, the most complex operation is `modify()` because it can do many things.\n", "\n", "It operates on a single TF dataset, and it produces a modified dataset as a fresh \"copy\".\n", "\n", "Despite the name, no actual modification takes place on the input dataset." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "location = f\"{PREFIX}/banks1/{SUFFIX}\"\n", "\n", "MODIFIED = \"_temp/mudbanks\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we take the first local copy of the Banks dataset as our input, for a lot of different operations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the list what `modify()` can do.\n", "The order is important, because all operations are executed in this order:\n", "\n", "1. **merge features**: several input features are combined into a single output feature and then deleted;\n", "2. **delete features**: several features are be deleted\n", "3. **add features**: several node/edge features with their data are added to the dataset\n", "4. **merge types**: several input node types are combined into a single output node type;\n", " the input node types are deleted, but not their nodes: they are now part of the output node type;\n", "5. **delete types**: several node types are deleted, *with their nodes*, and all features\n", " will be remapped to accommodate for this;\n", "6. **add types**: several new node types with additional feature data for them are added after the last node;\n", " features do not have to be remapped for this; the new node types may be arbitrary intervals of integers and\n", " have no relationship with the existing nodes.\n", "7. **modify metadata**: the metadata of all features can be tweaked, including everything that is in the\n", " `otext` feature, such as text formats and section structure definitions.\n", "\n", "Modify will perform as many sanity checks as possible before it starts working, so that the chances are good that\n", "the modified dataset will load properly.\n", "It will adapt the value type of features to the values encountered, and it will deduce whether edges have values or not.\n", "\n", "If a modified dataset does not load, while the original dataset did load, it is a bug, and I welcome a\n", "[GitHub issue](https://github.com/annotation/text-fabric/issues)\n", "for it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Only meta data\n", "\n", "We start with the last one, the most simple one." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "otext = dict(\n", " sectionTypes=\"book,chapter\",\n", " sectionFeatures=\"title,number\",\n", " **{\"fmt:text-orig-full\": \"{letters} \"},\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `silent=True` from now on, but if you work with larger datasets, it is recommended to set `silent=False` or\n", "to leave it out altogether." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"meta\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " featureMeta=dict(otext=otext),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have now only 2 section levels. If we ask for some sections, we see that we only get 2 components in the tuple." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Consider Phlebas', 1)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.sectionFromNode(1)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Consider Phlebas', 2)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.sectionFromNode(99)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge features\n", "\n", "We are going to do some tricky mergers on features that are involved in the section structure and the\n", "text formats, so we take care to modify those by means of the `featureMeta` parameter." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "otext = dict(\n", " sectionTypes=\"book,chapter\",\n", " sectionFeatures=\"heading,heading\",\n", " structureTypes=\"book,chapter\",\n", " structureFeatures=\"heading,heading\",\n", " **{\n", " \"fmt:text-orig-full\": \"{content} \",\n", " \"fmt:text-orig-fake\": \"{fake} \",\n", " \"fmt:line-default\": \"{content:XXX}{terminator} \",\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want sectional headings in one feature, `heading`, instead of in `title` for books and `number` for chapters.\n", "\n", "We also make a `content` feature that gives the `letters` of a word unless there is punctuation: then it gives `punc`.\n", "\n", "And we make the opposite: `fake`: it prefers `punc` over `letters`.\n", "\n", "Note that `punc` and `letters` will be deleted after the merge as a whole is completed, so that it is indeed\n", "possible for features to be the input of multiple mergers." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"merge.f\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " mergeFeatures=dict(\n", " heading=(\"title number\"), content=(\"punc letters\"), fake=(\"letters punc\")\n", " ),\n", " featureMeta=dict(\n", " otext=otext,\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We inspect the new `heading` feature for a book and a chapter." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Consider Phlebas'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = F.otype.s(\"book\")[0]\n", "F.heading.v(b)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = F.otype.s(\"chapter\")[0]\n", "F.heading.v(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here is an overview of all node features: `title` and `number` are gone, together with `punc` and `letters`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['author', 'content', 'fake', 'gap', 'heading', 'otype', 'terminator']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Fall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have modified the standard text format, `text-orig-full`. It now uses the `content` feature,\n", "and indeed, we do not see punctuation anymore." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Everything about us everything around us everything we know '" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text(range(1, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, `text-orig-fake` uses the `fake` feature, and we see that the words in front\n", "of punctuation have disappeared." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Everything about , everything around , everything we know '" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text(range(1, 10), fmt=\"text-orig-fake\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Delete features\n", "\n", "We just remove two features from the dataset: `author` and `terminator`." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " | Missing for text API: features: terminator\n" ] }, { "data": { "text/plain": [ "False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"delete.f\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " deleteFeatures=\"author terminator\",\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops. `terminator` is used in a text-format, so if we delete it, the dataset will not load properly.\n", "\n", "Let's not delete `terminator` but `gap`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"delete.f\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "modify(\n", " location,\n", " output,\n", " deleteFeatures=\"author gap\",\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['letters', 'number', 'otype', 'punc', 'terminator', 'title']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Fall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, `gap` is gone." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'NodeFeatures' object has no attribute 'gap'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mF\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgap\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfreqList\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'NodeFeatures' object has no attribute 'gap'" ] } ], "source": [ "F.gap.freqList()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I told you! Sigh ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add features\n", "\n", "We add a bunch of node features and edge features.\n", "\n", "When you add features, you also have to pass their data.\n", "Here we compute that data in place, which results in a lengthy call, but usually you'll get\n", "that data from somewhere in a dictionary, and you only pass the dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do not have to explicitly tell the value types of the new features, `modify()` will deduced them.\n", "We can override that by passing a value type explicitly.\n", "\n", "Let's declare `lemma` to be `str`, and `big` `int`:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " | Add features: big: feature values are declared to be int but some values are not int\n" ] }, { "data": { "text/plain": [ "False" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"add.f\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " addFeatures=dict(\n", " nodeFeatures=dict(\n", " author={101: \"Banks Jr.\", 102: \"Banks Sr.\"},\n", " lemma={n: 1000 + n for n in range(1, 10)},\n", " small={n: chr(ord(\"a\") + n % 26) for n in range(1, 10)},\n", " big={n: chr(ord(\"A\") + n % 26) for n in range(1, 10)},\n", " ),\n", " edgeFeatures=dict(\n", " link={n: {n + i for i in range(1, 3)} for n in range(1, 10)},\n", " similarity={\n", " n: {n + i: chr(ord(\"a\") + (i + n) % 26) for i in range(1, 3)}\n", " for n in range(1, 10)\n", " },\n", " ),\n", " ),\n", " featureMeta=dict(\n", " lemma=dict(\n", " valueType=\"str\",\n", " ),\n", " big=dict(\n", " valueType=\"int\",\n", " ),\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get away with `lemma` as string, because everything that is written is also a string.\n", "But not all values of `big` are numbers, so: complaint.\n", "\n", "Let's stick to the default:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"add.f\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " addFeatures=dict(\n", " nodeFeatures=dict(\n", " author={101: \"Banks Jr.\", 102: \"Banks Sr.\"},\n", " lemma={n: 1000 + n for n in range(1, 10)},\n", " small={n: chr(ord(\"a\") + n % 26) for n in range(1, 10)},\n", " big={n: chr(ord(\"A\") + n % 26) for n in range(1, 10)},\n", " ),\n", " edgeFeatures=dict(\n", " link={n: {n + i for i in range(1, 3)} for n in range(1, 10)},\n", " similarity={\n", " n: {n + i: chr(ord(\"a\") + (i + n) % 26) for i in range(1, 3)}\n", " for n in range(1, 10)\n", " },\n", " ),\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['author',\n", " 'big',\n", " 'gap',\n", " 'lemma',\n", " 'letters',\n", " 'number',\n", " 'otype',\n", " 'punc',\n", " 'small',\n", " 'terminator',\n", " 'title']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Fall()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['link', 'oslots', 'similarity']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Eall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the extra features, and let's just enumerate their mappings.\n", "\n", "`link` is an edge feature where edges do not have values.\n", "So for each `n`, the result is a set of nodes." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(1, frozenset({2, 3})), (2, frozenset({3, 4})), (3, frozenset({4, 5})), (4, frozenset({5, 6})), (5, frozenset({6, 7})), (6, frozenset({8, 7})), (7, frozenset({8, 9})), (8, frozenset({9, 10})), (9, frozenset({10, 11}))])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "E.link.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`similarity` assigns values to the edges. So for each `n`, the result is a mapping from nodes to values." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(1, {2: 'c', 3: 'd'}), (2, {3: 'd', 4: 'e'}), (3, {4: 'e', 5: 'f'}), (4, {5: 'f', 6: 'g'}), (5, {6: 'g', 7: 'h'}), (6, {7: 'h', 8: 'i'}), (7, {8: 'i', 9: 'j'}), (8, {9: 'j', 10: 'k'}), (9, {10: 'k', 11: 'l'})])" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "E.similarity.items()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2, 'c'), (3, 'd'))" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "E.similarity.f(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the node features." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(100, 'Iain M. Banks'), (101, 'Banks Jr.'), (102, 'Banks Sr.')])" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.author.items()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')])" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.small.items()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H'), (8, 'I'), (9, 'J')])" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.big.items()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([(1, 1001), (2, 1002), (3, 1003), (4, 1004), (5, 1005), (6, 1006), (7, 1007), (8, 1008), (9, 1009)])" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.lemma.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge types\n", "\n", "Manipulating features is relatively easy. But when we fiddle with the node types, we need our wits about us.\n", "\n", "In this example, we first do a feature merge of `title` and `number` into `nm`.\n", "\n", "Then we merge the `line` and `sentence` types into a new type `rule`.\n", "\n", "And `book` and `chapter` will merge into `section`.\n", "\n", "We adapt our section structure so that it makes use of the new features and types." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"merge.t\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " mergeFeatures=dict(nm=\"title number\"),\n", " mergeTypes=dict(\n", " rule=dict(\n", " line=dict(\n", " type=\"line\",\n", " ),\n", " sentence=dict(\n", " type=\"sentence\",\n", " ),\n", " ),\n", " section=dict(\n", " book=dict(\n", " type=\"book\",\n", " ),\n", " chapter=dict(\n", " type=\"chapter\",\n", " ),\n", " ),\n", " ),\n", " featureMeta=dict(\n", " otext=dict(\n", " sectionTypes=\"section,rule\",\n", " sectionFeatures=\"nm,nm\",\n", " structureTypes=\"section\",\n", " structureFeatures=\"nm\",\n", " ),\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We expect a severely reduced inventory of node types:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('section', 66.0, 100, 102),\n", " ('rule', 12.733333333333333, 103, 117),\n", " ('word', 1, 1, 99))" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['author', 'gap', 'letters', 'nm', 'otype', 'punc', 'terminator', 'type']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Fall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Delete types\n", "\n", "We delete the `line` and `sentence` types." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " | Missing for text API: types: line, sentence\n" ] }, { "data": { "text/plain": [ "False" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"delete.t\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " deleteTypes=\"sentence line\",\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, again, we can't do that because they are important for the text API.\n", "\n", "This time, we change the text API, so that it does not need them anymore." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"delete.t\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "modify(\n", " location,\n", " output,\n", " deleteTypes=\"sentence line\",\n", " featureMeta=dict(\n", " otext=dict(\n", " sectionTypes=\"book,chapter\",\n", " sectionFeatures=\"title,number\",\n", " structureTypes=\"book,chapter\",\n", " structureFeatures=\"title,number\",\n", " ),\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('book', 99.0, 100, 100), ('chapter', 49.5, 101, 102), ('word', 1, 1, 99))" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add types\n", "\n", "Adding types involves a lot of data, because we do not only add nodes, but also features about those nodes.\n", "\n", "The idea is this:\n", "\n", "Suppose that somewhere in another dataset, you have found lexeme nodes for the words in your data set.\n", "\n", "You just take those lexeme features, which may range from 100,000 to 110,000 say, and you find a way to map them to your\n", "words, by means of a map `nodeSlots`.\n", "\n", "Then you can just grab those lexeme functions *as they are*, and pack them into the `addTypes` argument,\n", "together with the `nodeSlots` and the node boundaries (100,000 - 110,000).\n", "\n", "The new feature data is not able to say something about nodes in the input data set, because the new nodes will be shifted\n", "so that they are past the `maxNode` of your input data set.\n", "And if your feature data accidentally addresses nodes outside the declared range, those assignments will be ignored.\n", "\n", "So all in all, it is a rather clean addition of material.\n", "\n", "Maybe a bit too clean, because it is also impossible to add edge features that link the new nodes to the old nodes.\n", "But then, it would be devilishly hard to make sure that after the necessary remapping of the edge features,\n", "they address the intended nodes.\n", "\n", "If you do want edge features between old and new nodes, it is better to compute them in the new dataset and add them\n", "as an individual feature or by another call to `modify()`.\n", "\n", "Let's have a look at an example where we add a type `bis` consisting of a few bigrams, and a type `tris`,\n", "consisting of a bunch `trigrams`.\n", "\n", "We just furnish a slot mapping for those nodes, and give them a `name` feature." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = \"add.t\"\n", "output = f\"{MODIFIED}.{test}\"\n", "\n", "!rm -rf {output}\n", "\n", "modify(\n", " location,\n", " output,\n", " addTypes=dict(\n", " bis=dict(\n", " nodeFrom=1,\n", " nodeTo=5,\n", " nodeSlots={\n", " 1: {10, 11},\n", " 2: {20, 21},\n", " 3: {30, 31},\n", " 4: {40, 41},\n", " 5: {50, 51},\n", " },\n", " nodeFeatures=dict(\n", " name={\n", " 1: \"b1\",\n", " 2: \"b2\",\n", " 3: \"b3\",\n", " 4: \"b4\",\n", " 5: \"b5\",\n", " },\n", " ),\n", " edgeFeatures=dict(\n", " link={\n", " 1: {2: 100, 3: 50, 4: 25},\n", " 2: {3: 50, 4: 25, 5: 12},\n", " 3: {4: 25, 5: 12},\n", " 4: {5: 12, 1: 6},\n", " 5: {1: 6, 2: 3, 4: 1},\n", " },\n", " ),\n", " ),\n", " tris=dict(\n", " nodeFrom=1,\n", " nodeTo=4,\n", " nodeSlots={\n", " 1: {60, 61, 62},\n", " 2: {70, 71, 72},\n", " 3: {80, 81, 82},\n", " 4: {90, 91, 94},\n", " },\n", " nodeFeatures=dict(\n", " name={\n", " 1: \"tr1\",\n", " 2: \"tr2\",\n", " 3: \"tr3\",\n", " 4: \"tr4\",\n", " },\n", " ),\n", " edgeFeatures=dict(\n", " sim={\n", " 1: {2, 3, 4},\n", " 2: {3, 4},\n", " 3: {4},\n", " 4: {5, 1},\n", " },\n", " ),\n", " ),\n", " ),\n", " silent=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Result" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "TF = Fabric(locations=f\"{MODIFIED}.{test}\", silent=True)\n", "api = TF.loadAll(silent=True)\n", "docs = api.makeAvailableIn(globals())" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('book', 99.0, 100, 100),\n", " ('chapter', 49.5, 101, 102),\n", " ('sentence', 33.0, 115, 117),\n", " ('line', 7.666666666666667, 103, 114),\n", " ('tris', 3.0, 123, 126),\n", " ('bis', 2.0, 118, 122),\n", " ('word', 1, 1, 99))" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are the `bis` and `tris`!" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['author',\n", " 'gap',\n", " 'letters',\n", " 'name',\n", " 'number',\n", " 'otype',\n", " 'punc',\n", " 'terminator',\n", " 'title']" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Fall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And there is the new feature `name`:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(118, 'b1'),\n", " (119, 'b2'),\n", " (120, 'b3'),\n", " (121, 'b4'),\n", " (122, 'b5'),\n", " (123, 'tr1'),\n", " (124, 'tr2'),\n", " (125, 'tr3'),\n", " (126, 'tr4')]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(F.name.items())" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['link', 'oslots', 'sim']" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Eall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the new edge features `link` and `sim`:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(118, {121: '25', 120: '50', 119: '100'}),\n", " (119, {122: '12', 121: '25', 120: '50'}),\n", " (120, {122: '12', 121: '25'}),\n", " (121, {118: '6', 122: '12'}),\n", " (122, {121: '1', 119: '3', 118: '6'})]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(E.link.items())" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(123, frozenset({124, 125, 126})),\n", " (124, frozenset({125, 126})),\n", " (125, frozenset({126})),\n", " (126, frozenset({123}))]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(E.sim.items())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that is all for now.\n", "\n", "Incredible that you made it till here!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "All chapters:\n", "\n", "* [use](use.ipynb)\n", "* [share](share.ipynb)\n", "* [app](app.ipynb)\n", "* [repo](repo.ipynb)\n", "* *compose*\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" }, "toc-autonumbering": false, "toc-showtags": false, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }