{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "# Turn your corpus into a Text-Fabric dataset\n", "\n", "## Corpus\n", "\n", "We start with a baby corpus in a markdown like format.\n", "\n", "**This is not meant as a preferred format of a corpus.**\n", "\n", "The point of this tutorial is to show what it takes\n", "to turn arbitrary data into TF.\n", "\n", "Below you see a string which contains \n", "1 \"book\", with 2 \"chapters\", each having one or two sentences.\n", "\n", "Here is the complete corpus source material:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "source = '''\n", "# Consider Phlebas\n", "$ author=Iain M. Banks\n", "\n", "## 1\n", "Everything about us,\n", "everything around us,\n", "everything we know [and can know of] is composed ultimately of patterns of nothing;\n", "that’s the bottom line, the final truth.\n", "\n", "So where we find we have any control over those patterns,\n", "why not make the most elegant ones, the most enjoyable and good ones,\n", "in our own terms?\n", "\n", "## 2\n", "Besides,\n", "it left the humans in the Culture free to take care of the things that really mattered in life,\n", "such as [sports, games, romance,] studying dead languages,\n", "barbarian societies and impossible problems,\n", "and climbing high mountains without the aid of a safety harness.\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note a few details:\n", "\n", "* `# ` starts a \"book\" and the line contains its title: section level 1.\n", "* `$` starts a line with key=value metadata: the author.\n", " This line is not part of the text.\n", "* `## ` starts a \"chapter\" and the line contains its number: section level 2.\n", "* *blank lines* separate sentences.\n", " There are 2 sentences in the first chapter and 1 in the second one.\n", "* We will give each sentence a number within its chapter.\n", "* The sentences are divided into lines.\n", "* We will give each line a number within its sentence.\n", "* Words within [ ] will not be part of the line, the line has a gap.\n", " But these words still belong to the text.\n", "* The gapped words will have a feature `gap=1`.\n", "* Lines will be split into words: the slot nodes.\n", "* We separate the word from its punctuation, which gets added in a `punc` feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fire up\n", "\n", "Now we start the engines: Text-Fabric, and the *walker* conversion module." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "\n", "from tf.fabric import Fabric\n", "from tf.convert.walker import CV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We call up TF and let it look into the directory where the output has to land,\n", "in this case a subdirectory of the tutorials repo on annotation." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# TF_DIR = os.path.expanduser('~/Downloads/banks/tf') # if you want it in your Downloads directory instead\n", "BASE = os.path.expanduser('~/github')\n", "ORG = 'annotation'\n", "REPO = 'banks'\n", "RELATIVE = 'tf'\n", "\n", "TF_DIR = os.path.expanduser(f'{BASE}/{ORG}/{REPO}/{RELATIVE}')\n", "\n", "VERSION = '0.2'\n", "\n", "TF_PATH = f'{TF_DIR}/{VERSION}'\n", "TF = Fabric(locations=TF_PATH, silent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TF configuration\n", "\n", "A Text-Fabric dataset is a bunch of individual `.tf` files that start with a little bit of metadata and then contain \n", "a stream of data, typically the values of a single feature for each node or edge in the graph.\n", "\n", "We specify the metadata bit by bit.\n", "\n", "### slot type\n", "\n", "A crucial design aspect of each TF dataset is its granularity. What are the slots?\n", "\n", "Words, morphemes, characters?\n", "\n", "You decide." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "slotType = 'word'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Provenance\n", "\n", "Users that encounter your TF data in the wild, will be thankful to you if you took the\n", "trouble to say in a few key-value pairs what that data is about.\n", "\n", "The metadata you specify here will end up in all generated TF features." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "generic = {\n", " 'name': 'Culture quotes from Iain Banks',\n", " 'compiler': 'Dirk Roorda',\n", " 'source': 'Good Reads',\n", " 'url': 'https://www.goodreads.com/work/quotes/14366-consider-phlebas',\n", " 'version': '0.2',\n", " 'purpose': 'exposition',\n", " 'status': 'with for similarities in a separate module'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text matters\n", "\n", "A few things concerning the presentation of your text can be specified in the `otext` feature.\n", "This is a TF feature without data, it has only metadata.\n", "\n", "It contains the specs for the section structure of your corpus and the text formats.\n", "\n", "#### Section structure\n", "\n", "TF assumes that there are two or three section levels it can work with.\n", "For each level you have to specify the corresponding node type and the feature that contains the section name or number\n", "(`sectionTypes` and `sectionFeatures`).\n", "\n", "#### Your own section structure\n", "\n", "But you can also define a more extensive and flexible section structure for your own purposes\n", "(`structureTypes` and `structureFeatures`).\n", "Both systems may use the same types and features, but they are completely independent.\n", "\n", "#### Text formats\n", "\n", "When you ask TF to render slot nodes (the nodes with text), TF needs to know\n", "which features to render. \n", "\n", "A text format is a template with placeholders for the features you want to use." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "otext = {\n", " 'fmt:text-orig-full': '{letters}{punc} ',\n", " 'fmt:line-term': 'line#{terminator} ',\n", " 'fmt:line-default': '{letters:XXX}{terminator} ',\n", " 'sectionTypes': 'book,chapter,sentence',\n", " 'sectionFeatures': 'title,number,number',\n", " 'structureTypes': 'book,chapter,sentence,line',\n", " 'structureFeatures': 'title,number,number,number',\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Typing\n", "\n", "The values of features are usually strings.\n", "But if you know that they are always integers, you can declare a feature as an integer valued feature.\n", "\n", "The only thing you have to do is to declare them in the following set:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "intFeatures = {\n", " 'number',\n", " 'gap',\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Descriptions\n", "\n", "You can say per feature what it does.\n", "Use as many key-values as you like.\n", "\n", "A good *description* is particularly helpful.\n", "\n", "It is surprising how often you want to consult those descriptions yourself." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "featureMeta = {\n", " 'number': {\n", " 'description': 'number of chapter, or sentence in chapter, or line in sentence',\n", " },\n", " 'gap': {\n", " 'description': '1 for words that occur between [ ], which are inserted by the editor',\n", " },\n", " 'title': {\n", " 'description': 'the title of a book',\n", " },\n", " 'author': {\n", " 'description': 'the author of a book',\n", " },\n", " 'terminator': {\n", " 'description': 'the last character of a line',\n", " },\n", " 'letters': {\n", " 'description': 'the letters of a word',\n", " },\n", " 'punc': {\n", " 'description': 'the punctuation after a word',\n", " },\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Director\n", "\n", "This is the heart of the matter.\n", "\n", "The director is a function to be written by you.\n", "\n", "It needs to plough through your source material, and fire actions towards the TF machinery.\n", "The TF work is done for you by these actions, you can concentrate on the aspects of your\n", "source data.\n", "\n", "Every time the director encounters a new textual object in the source,\n", "it issues an action requesting a new node.\n", "The director gets a receipt, by which it can issue subsequent\n", "actions for that node, like adding feature values to it.\n", "\n", "And when the object is done, the director issues a `terminate` action.\n", "\n", "During all this, the `cv` machine is busy to translate these actions into\n", "the construction of a proper TF graph representing all the\n", "source material that you have exposed to it.\n", "\n", "A few things to note\n", "\n", "* If you want to terminate a node, you do not have to worry whether the node exists or has already\n", " been terminated: just do it;\n", "* If you have terminated a node, you can resume it later;\n", "* When you add nodes, slot and non-slots, there is magic behind the scenes:\n", " * when a **slot** node is added, it will be linked to all active non-slot nodes,\n", " i.e. the ones that have not been terminated or have been resumed;\n", " * when a **non-slot** node is added, is becomes automatically active,\n", " in the sense that it will be linked to subsequent slot nodes, before it is terminated,\n", " or after it has been resumed;\n", "* If a fatal error is encountered, the director can simply say `cv.stop(message)`;\n", " the director is responsible for returning control after issuing a `cv.stop)`;\n", "* If the actions involve section nodes, it will be checked whether all slots occur in a section,\n", " and whether big sections such as books will not start, end, or terminate inside small sections such\n", " as verses. Warnings will be issued, but you can suppress them;\n", "* After the director is done, TF will perform several checks on the result." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def director(cv):\n", " counter = dict(\n", " sentence=0,\n", " line=0,\n", " )\n", " cur = dict(\n", " book=None,\n", " chapter=None,\n", " sentence=None,\n", " )\n", "\n", " wordRe = re.compile(r'^(.*?)([^A-Za-z0-9]*)$')\n", " metaRe = re.compile(r'^\\$\\s*([^= ]+)\\s*=\\s*(.*)')\n", " \n", " for line in source.strip().split('\\n'):\n", " line = line.rstrip()\n", " if not line:\n", " cv.terminate(cur['sentence']) # action\n", " for ntp in counter:\n", " counter[ntp] += 1\n", " cur['sentence'] = cv.node('sentence') # action\n", " cv.feature(\n", " cur['sentence'],\n", " number=counter['sentence'],\n", " ) # action\n", " continue\n", " \n", " if line.startswith('# '):\n", " for ntp in ('sentence', 'chapter', 'book'):\n", " cv.terminate(cur[ntp]) # action\n", " cur[ntp] = None \n", " title = line[2:].strip()\n", " cur['book'] = cv.node('book') # action\n", " for ntp in counter:\n", " counter[ntp] = 0\n", " cv.feature(\n", " cur['book'],\n", " title=title,\n", " ) # action\n", " continue\n", "\n", " if line.startswith('## '):\n", " for ntp in ('sentence', 'chapter'):\n", " cv.terminate(cur[ntp]) # action\n", " cur[ntp] = None \n", " number = line[2:].strip()\n", " cur['chapter'] = cv.node('chapter') # action\n", " for ntp in counter:\n", " counter[ntp] = 0\n", " cv.feature(\n", " cur['chapter'],\n", " number=number,\n", " ) # action\n", " continue\n", "\n", " if line.startswith('$'):\n", " match = metaRe.match(line)\n", " if not match:\n", " cv.stop(\n", " f'Malformed metadata line: \"{line}\"'\n", " ) # action\n", " return\n", " name = match.group(1)\n", " value = match.group(2)\n", " cv.feature(\n", " cur['book'],\n", " **{name: value},\n", " ) # action\n", " continue\n", " \n", " if not cur['sentence']:\n", " cur['sentence'] = cv.node('sentence') # action\n", " counter['sentence'] += 1\n", " cv.feature(\n", " cur['sentence'],\n", " number=counter['sentence'],\n", " ) # action\n", " \n", " cur['line'] = cv.node('line') # action\n", " counter['line'] += 1\n", " cv.feature(\n", " cur['line'],\n", " terminator=line[-1],\n", " number=counter['line'],\n", " ) # action\n", " \n", " gap = False\n", " for word in line.split():\n", " if word.startswith('['):\n", " gap = True\n", " cv.terminate(cur['line']) # action\n", " w = cv.slot() # action\n", " cv.feature(w, gap=1) # action\n", " word = word[1:]\n", " elif word.endswith(']'):\n", " w = cv.slot() # action\n", " cv.resume(cur['line']) # action\n", " cv.feature(w, gap=1) # action\n", " gap = False\n", " word = word[0:-1]\n", " else:\n", " w = cv.slot()\n", " if gap:\n", " cv.feature(w, gap=1) # action\n", "\n", " (letters, punc) = wordRe.findall(word)[0]\n", " cv.feature(w, letters=letters) # action\n", " if punc:\n", " cv.feature(w, punc=punc) # action\n", " cv.terminate(cur['line']) # action\n", " curLine = None\n", " \n", " # just for informational purposes\n", " print('\\nINFORMATION:', cv.activeTypes(), '\\n') # action\n", " \n", " for ntp in ('sentence', 'chapter', 'book'):\n", " cv.terminate(cur[ntp]) # action\n", "\n", " cv.meta(\n", " 'punc', remark='a bit more info is needed',\n", " ) # action" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run\n", "\n", "We are going to run the conversion and check whether all is well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we initialize the conversion machinery: we obtain an object with methods." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Importing data from walking through the source ...\n", " | 0.00s Preparing metadata... \n", " | SECTION TYPES: book, chapter, sentence\n", " | SECTION FEATURES: title, number, number\n", " | STRUCTURE TYPES: book, chapter, sentence, line\n", " | STRUCTURE FEATURES: title, number, number, number\n", " | TEXT FEATURES:\n", " | | line-default letters, terminator\n", " | | line-term terminator\n", " | | text-orig-full letters, punc\n", " | 0.01s OK\n", " | 0.00s Following director... \n", "\n", "INFORMATION: {'sentence', 'chapter', 'book'} \n", "\n", " | 0.00s \"edge\" actions: 0\n", " | 0.00s \"feature\" actions: 144\n", " | 0.00s \"node\" actions: 20\n", " | 0.00s \"resume\" actions: 2\n", " | 0.00s \"slot\" actions: 99\n", " | 0.00s \"terminate\" actions: 27\n", " | 1 x \"book\" node \n", " | 2 x \"chapter\" node \n", " | 12 x \"line\" node \n", " | 5 x \"sentence\" node \n", " | 99 x \"word\" node = slot type\n", " | 119 nodes of all types\n", " | 0.01s OK\n", " | 0.00s Removing unlinked nodes ... \n", " | | -0.00s 2 unlinked \"sentence\" nodes: [1, 4]\n", " | | 0.00s 2 unlinked nodes\n", " | | 0.00s Leaving 117 nodes\n", " | 0.00s checking for nodes and edges ... \n", " | 0.00s OK\n", " | 0.00s checking features ... \n", " | 0.00s OK\n", " | 0.00s reordering nodes ...\n", " | 0.00s Sorting 1 nodes of type \"book\"\n", " | 0.00s Sorting 2 nodes of type \"chapter\"\n", " | 0.00s Sorting 12 nodes of type \"line\"\n", " | 0.00s Sorting 3 nodes of type \"sentence\"\n", " | 0.00s Max node = 117\n", " | 0.00s OK\n", " | 0.00s reassigning feature values ...\n", " | | 0.01s node feature \"author\" with 1 node\n", " | | 0.01s node feature \"gap\" with 7 nodes\n", " | | 0.01s node feature \"letters\" with 99 nodes\n", " | | 0.01s node feature \"number\" with 17 nodes\n", " | | 0.01s node feature \"punc\" with 17 nodes\n", " | | 0.01s node feature \"terminator\" with 12 nodes\n", " | | 0.01s node feature \"title\" with 1 node\n", " | 0.00s OK\n", " 0.00s Exporting 8 node and 1 edge and 1 config features to /Users/dirk/github/annotation/banks/tf/0.2:\n", " 0.00s VALIDATING oslots feature\n", " 0.00s VALIDATING oslots feature\n", " 0.00s maxSlot= 99\n", " 0.00s maxNode= 117\n", " 0.00s OK: oslots is valid\n", " | 0.00s T author to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.00s T gap to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.01s T letters to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.01s T number to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.01s T otype to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.00s T punc to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.00s T terminator to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.00s T title to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.00s T oslots to /Users/dirk/github/annotation/banks/tf/0.2\n", " | 0.00s M otext to /Users/dirk/github/annotation/banks/tf/0.2\n", " 0.05s Exported 8 node features and 1 edge features and 1 config features to /Users/dirk/github/annotation/banks/tf/0.2\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv = CV(TF)\n", "\n", "good = cv.walk(\n", " director,\n", " slotType,\n", " otext=otext,\n", " generic=generic,\n", " intFeatures=intFeatures,\n", " featureMeta=featureMeta,\n", ")\n", "\n", "good" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect\n", "\n", "Let's inspect some of the files:\n", "\n", "### `otype`" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@node\n", "@compiler=Dirk Roorda\n", "@name=Culture quotes from Iain Banks\n", "@purpose=exposition\n", "@source=Good Reads\n", "@status=with for similarities in a separate module\n", "@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas\n", "@valueType=str\n", "@version=0.2\n", "@writtenBy=Text-Fabric\n", "@dateWritten=2020-02-13T06:46:28Z\n", "\n", "1-99\tword\n", "100\tbook\n", "101-102\tchapter\n", "103-114\tline\n", "115-117\tsentence\n", "\n" ] } ], "source": [ "with open(f'{TF_PATH}/otype.tf') as fh:\n", " print(fh.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `otext`" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@config\n", "@compiler=Dirk Roorda\n", "@fmt:line-default={letters:XXX}{terminator} \n", "@fmt:line-term=line#{terminator} \n", "@fmt:text-orig-full={letters}{punc} \n", "@name=Culture quotes from Iain Banks\n", "@purpose=exposition\n", "@sectionFeatures=title,number,number\n", "@sectionTypes=book,chapter,sentence\n", "@source=Good Reads\n", "@status=with for similarities in a separate module\n", "@structureFeatures=title,number,number,number\n", "@structureTypes=book,chapter,sentence,line\n", "@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas\n", "@version=0.2\n", "@writtenBy=Text-Fabric\n", "@dateWritten=2020-02-13T06:46:28Z\n", "\n", "\n" ] } ], "source": [ "with open(f'{TF_PATH}/otext.tf') as fh:\n", " print(fh.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `oslots`" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@edge\n", "@compiler=Dirk Roorda\n", "@name=Culture quotes from Iain Banks\n", "@purpose=exposition\n", "@source=Good Reads\n", "@status=with for similarities in a separate module\n", "@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas\n", "@valueType=str\n", "@version=0.2\n", "@writtenBy=Text-Fabric\n", "@dateWritten=2020-02-13T06:46:28Z\n", "\n", "100\t1-99\n", "1-55\n", "56-99\n", "1-3\n", "4-6\n", "7-9,14-20\n", "21-27\n", "28-38\n", "39-51\n", "52-55\n", "56\n", "57-75\n", "76-77,81-83\n", "84-88\n", "89-99\n", "1-27\n", "28-55\n", "56-99\n", "\n" ] } ], "source": [ "with open(f'{TF_PATH}/oslots.tf') as fh:\n", " print(fh.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explanation\n", "\n", "* line `100\t1-99` tells that node 100 (the first book node, see `otext`, is linked to slots 1-99, which are all slots.\n", "* the next line has only `1-55`. These are the slots of node 101, being 1 + the previous node.\n", "\n", "And so on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### number" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@node\n", "@compiler=Dirk Roorda\n", "@description=number of chapter, or sentence in chapter, or line in sentence\n", "@name=Culture quotes from Iain Banks\n", "@purpose=exposition\n", "@source=Good Reads\n", "@status=with for similarities in a separate module\n", "@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas\n", "@valueType=int\n", "@version=0.2\n", "@writtenBy=Text-Fabric\n", "@dateWritten=2020-02-13T06:46:28Z\n", "\n", "101\t1\n", "2\n", "1\n", "2\n", "3\n", "4\n", "6\n", "7\n", "8\n", "1\n", "2\n", "3\n", "4\n", "5\n", "1\n", "2\n", "1\n", "\n" ] } ], "source": [ "with open(f'{TF_PATH}/number.tf') as fh:\n", " print(fh.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explanation\n", "\n", "Node 101 is the first chapter node, which has chapter number 1.\n", "\n", "The next line is about node 102, the second chapter, with number 2.\n", "\n", "The following line refers to node 103, and a quick glance at the `otype` feature shows that this is a line.\n", "\n", "The last three lines are about the three sentences, which are numbered within their chapter:\n", "`1` then `2` and then again `1`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### letters" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@node\n", "@compiler=Dirk Roorda\n", "@description=the letters of a word\n", "@name=Culture quotes from Iain Banks\n", "@purpose=exposition\n", "@source=Good Reads\n", "@status=with for similarities in a separate module\n", "@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas\n", "@valueType=str\n", "@version=0.2\n", "@writtenBy=Text-Fabric\n", "@dateWritten=2020-02-13T06:46:28Z\n", "\n", "Everything\n", "about\n", "us\n", "everything\n", "around\n", "us\n", "everything\n", "we\n", "know\n", "and\n", "can\n", "know\n", "of\n", "is\n", "composed\n", "ultimately\n", "of\n", "patterns\n", "of\n", "nothing\n", "that’s\n", "the\n", "bottom\n", "line\n", "the\n", "final\n", "truth\n", "So\n", "where\n", "we\n", "find\n", "we\n", "have\n", "any\n", "control\n", "over\n", "those\n", "patterns\n", "why\n", "not\n", "make\n", "the\n", "most\n", "elegant\n", "ones\n", "the\n", "most\n", "enjoyable\n", "and\n", "good\n", "ones\n", "in\n", "our\n", "own\n", "terms\n", "Besides\n", "it\n", "left\n", "the\n", "humans\n", "in\n", "the\n", "Culture\n", "free\n", "to\n", "take\n", "care\n", "of\n", "the\n", "things\n", "that\n", "really\n", "mattered\n", "in\n", "life\n", "such\n", "as\n", "sports\n", "games\n", "romance\n", "studying\n", "dead\n", "languages\n", "barbarian\n", "societies\n", "and\n", "impossible\n", "problems\n", "and\n", "climbing\n", "high\n", "mountains\n", "without\n", "the\n", "aid\n", "of\n", "a\n", "safety\n", "harness\n", "\n" ] } ], "source": [ "with open(f'{TF_PATH}/letters.tf') as fh:\n", " print(fh.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explanation\n", "\n", "The plain, clean text of everything." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `punc`" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@node\n", "@compiler=Dirk Roorda\n", "@description=the punctuation after a word\n", "@name=Culture quotes from Iain Banks\n", "@purpose=exposition\n", "@remark=a bit more info is needed\n", "@source=Good Reads\n", "@status=with for similarities in a separate module\n", "@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas\n", "@valueType=str\n", "@version=0.2\n", "@writtenBy=Text-Fabric\n", "@dateWritten=2020-02-13T06:46:28Z\n", "\n", "3\t,\n", "6\t,\n", "20\t;\n", "24\t,\n", "27\t.\n", "38\t,\n", "45\t,\n", "51\t,\n", "55\t?\n", ",\n", "75\t,\n", "78\t,\n", ",\n", ",\n", "83\t,\n", "88\t,\n", "99\t.\n", "\n" ] } ], "source": [ "with open(f'{TF_PATH}/punc.tf') as fh:\n", " print(fh.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the metadata field `remark=a bit more info is needed` which was added \"last-minute\" by means of a `cv.meta()` action." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Links\n", "\n", "Of course, there is much more to TF.\n", "\n", "Have a look through tutorials for several corpora: Hebrew and Syriac Bible, Quran, Uruk Cuneiform.\n", "\n", "Navigate from [here](https://nbviewer.jupyter.org/github/annotation/tutorials/tree/master/).\n", "\n", "Now conversion is this easy, more corpora will follow.\n", "\n", "The docs for conversion are [here](https://annotation.github.io/text-fabric/Create/Convert/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "Tutorial for Banks\n", "\n", "* [use](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/banks/use.ipynb)\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 4 }