{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "# Tutorial\n", "\n", "This notebook gets you started with using\n", "[Text-Fabric](https://dans-labs.github.io/text-fabric/) for coding in the Hebrew Bible.\n", "\n", "Chances are that a bit of reading about the underlying\n", "[data model](https://dans-labs.github.io/text-fabric/Model/Data-Model/)\n", "helps you to follow the exercises below, and vice versa." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing Text-Fabric\n", "\n", "### Python\n", "\n", "You need to have Python on your system. Most systems have it out of the box,\n", "but alas, that is python2 and we need at least python **3.6**.\n", "\n", "Install it from [python.org](https://www.python.org) or from\n", "[Anaconda](https://www.anaconda.com/download).\n", "\n", "### Jupyter notebook\n", "\n", "You need [Jupyter](http://jupyter.org).\n", "\n", "If it is not already installed:\n", "\n", "```\n", "pip3 install jupyter\n", "```\n", "\n", "### TF itself\n", "\n", "```\n", "pip3 install text-fabric\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:16.202764Z", "start_time": "2018-05-18T09:17:16.197546Z" } }, "outputs": [], "source": [ "import sys, os, collections\n", "from IPython.display import HTML" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:17.537171Z", "start_time": "2018-05-18T09:17:17.517809Z" } }, "outputs": [], "source": [ "from tf.fabric import Fabric\n", "from tf.extra.bhsa import Bhsa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Call Text-Fabric\n", "\n", "Everything starts by calling up Text-Fabric.\n", "It needs to know where to look for data.\n", "\n", "The Hebrew Bible is in the same repository as this tutorial.\n", "I assume you have cloned [bhsa](https://github.com/etcbc/bhsa)\n", "and [phono](https://github.com/etcbc/phono)\n", "in your directory `~/github/etcbc`, so that your directory structure looks like this\n", "\n", " your home direcectory\\\n", " | - github\\\n", " | | - etcbc\\\n", " | | | - bhsa\n", " | | | - phono\n", " \n", "## Tip\n", "If you start computing with this tutorial, first copy its parent directory to somewhere else,\n", "outside your `bhsa` directory.\n", "If you pull changes from the `bhsa` repository later, your work will not be overwritten.\n", "Where you put your tutorial directory is up till you.\n", "It will work from any directory." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:19.878701Z", "start_time": "2018-05-18T09:17:19.859972Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 5.4.2\n", "Api reference : https://dans-labs.github.io/text-fabric/Api/General/\n", "Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb\n", "Example data : https://github.com/Dans-labs/text-fabric-data\n", "\n", "118 features found and 0 ignored\n" ] } ], "source": [ "VERSION = '2017'\n", "DATABASE = '~/github/etcbc'\n", "BHSA = f'bhsa/tf/{VERSION}'\n", "PHONO = f'phono/tf/{VERSION}'\n", "TF = Fabric(locations=[DATABASE], modules=[BHSA, PHONO], silent=False )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have added a module `phono`. \n", "The BHSA data has a special 1-1 transcription from Hebrew to ASCII, \n", "but not a *phonetic* transcription.\n", "\n", "I have made a \n", "[notebook](https://github.com/etcbc/phono/blob/master/programs/phono.ipynb)\n", "that tries hard to find phonological representations for all the words.\n", "The result is a module in text-fabric format.\n", "We'll encounter that later.\n", "\n", "**NB:** This is a real-world example of how to add data to an existing data source as a module." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load Features\n", "The data of the BHSA is organized in features.\n", "They are *columns* of data.\n", "Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the\n", "first word, row 2 to the second word, and so on, for all 425,000 words.\n", "\n", "The information which part-of-speech each word is, constitutes a column in that spreadsheet.\n", "The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more\n", "textual objects.\n", "\n", "Instead of putting that information in one big table, the data is organized in separate columns.\n", "We call those columns **features**.\n", "\n", "We just load the features we need for this tutorial.\n", "Later on, where we use them, it will become clear what they mean." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:31.204738Z", "start_time": "2018-05-18T09:17:25.793730Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s loading features ...\n", " 0.67s All features loaded/computed - for details use loadLog()\n" ] } ], "source": [ "api = TF.load('''\n", " sp lex voc_lex_utf8\n", " g_word trailer\n", " g_lex_utf8\n", " qere qere_trailer\n", " language freq_lex gloss\n", " mother\n", "''')\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of this all is that we have a bunch of special variables at our disposal\n", "that give us access to the text and data of the Hebrew Bible.\n", "\n", "At this point it is helpful to throw a quick glance at the text-fabric\n", "[API documentation](https://dans-labs.github.io/text-fabric/Api/General/).\n", "\n", "The most essential thing for now is that we can use `F` to access the data in the features\n", "we've loaded.\n", "But there is more, such as `N`, which helps us to walk over the text, as we see in a minute." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More power\n", "\n", "There are extra functions on top of Text-Fabric that know about the Hebrew Bible.\n", "Lets acquire additional power." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:35.677198Z", "start_time": "2018-05-18T09:17:34.694968Z" } }, "outputs": [ { "data": { "text/markdown": [ "**Documentation:** BHSA Feature docs BHSA API Text-Fabric API 5.4.2 Search Reference" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "This notebook online:\n", "NBViewer\n", "GitHub\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "B = Bhsa(api, 'start', version=VERSION)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few things to note:\n", "\n", "* You supply the `api` as first argument to `Bhsa()`\n", "* You supply the plain *name* of the notebook that you are writing as the second argument\n", "* You supply the *version* of the BHSA data as the third argument\n", "\n", "The result is that you have a few handy links to \n", "\n", "* the data provenance and documentation\n", "* the BHSA API and the Text-Fabric API\n", "* the online versions of this notebook on GitHub and NBViewer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Search\n", "Text-Fabric contains a flexible search engine, that does not only work for the BHSA data,\n", "but also for data that you add to it.\n", "\n", "**Search is the quickest way to come up-to-speed with your data, without too much programming.**\n", "\n", "Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite.\n", "And if you already know MQL queries, you can build from that in\n", "[searchFromMQL](searchFromMQL.ipynb).\n", "\n", "The real power of search lies in the fact that it is integrated in a programming environment.\n", "You can use programming to:\n", "\n", "* compose dynamic queries\n", "* process query results\n", "\n", "Therefore, the rest of this tutorial is still important when you want to tap that power.\n", "If you continue here, you learn all the basics of data-navigation with Text-Fabric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Counting\n", "\n", "In order to get acquainted with the data, we start with the simple task of counting.\n", "\n", "## Count all nodes\n", "We use the \n", "[`N()` generator](https://dans-labs.github.io/text-fabric/Api/General/#navigating-nodes)\n", "to walk through the nodes.\n", "\n", "We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words.\n", "In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.\n", "\n", "We also mentioned that there are also 1,000,000 more textual objects. \n", "They are the phrases, clauses, sentences, verses, chapters and books.\n", "They also correspond to rows in the big spreadsheet.\n", "\n", "In Text-Fabric we call all these rows *nodes*, and the `N()` generator\n", "carries us through those nodes in the textual order.\n", "\n", "Just one extra thing: the `info` statements generate timed messages.\n", "If you use them instead of `print` you'll get a sense of the amount of time that \n", "the various processing steps typically need." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:43.894153Z", "start_time": "2018-05-18T09:17:43.597128Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting nodes ...\n", " 0.30s 1446635 nodes\n" ] } ], "source": [ "indent(reset=True)\n", "info('Counting nodes ...')\n", "\n", "i = 0\n", "for n in N(): i += 1\n", "\n", "info('{} nodes'.format(i))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see it: 1,4 M nodes!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What are those million nodes?\n", "Every node has a type, like word, or phrase, sentence.\n", "We know that we have approximately 425,000 words and a million other nodes.\n", "But what exactly are they?\n", "\n", "Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.\n", "`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.\n", "\n", "Here we go!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:47.820323Z", "start_time": "2018-05-18T09:17:47.812328Z" } }, "outputs": [ { "data": { "text/plain": [ "'word'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.slotType" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:48.549430Z", "start_time": "2018-05-18T09:17:48.543371Z" } }, "outputs": [ { "data": { "text/plain": [ "426584" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.maxSlot" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:49.251302Z", "start_time": "2018-05-18T09:17:49.244467Z" } }, "outputs": [ { "data": { "text/plain": [ "1446635" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.maxNode" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:49.922863Z", "start_time": "2018-05-18T09:17:49.916078Z" } }, "outputs": [ { "data": { "text/plain": [ "('book',\n", " 'chapter',\n", " 'lex',\n", " 'verse',\n", " 'half_verse',\n", " 'sentence',\n", " 'sentence_atom',\n", " 'clause',\n", " 'clause_atom',\n", " 'phrase',\n", " 'phrase_atom',\n", " 'subphrase',\n", " 'word')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.all" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:51.782779Z", "start_time": "2018-05-18T09:17:51.774167Z" } }, "outputs": [ { "data": { "text/plain": [ "(('book', 10938.051282051281, 426585, 426623),\n", " ('chapter', 459.18622174381056, 426624, 427552),\n", " ('lex', 46.2021011588866, 1437403, 1446635),\n", " ('verse', 18.37694395381898, 1414190, 1437402),\n", " ('half_verse', 9.441876936697653, 606323, 651502),\n", " ('sentence', 6.695609863288914, 1172209, 1235919),\n", " ('sentence_atom', 6.615141270973544, 1235920, 1300405),\n", " ('clause', 4.841988172665464, 427553, 515653),\n", " ('clause_atom', 4.704849507549438, 515654, 606322),\n", " ('phrase', 1.6848574373881755, 651503, 904689),\n", " ('phrase_atom', 1.5945932812248849, 904690, 1172208),\n", " ('subphrase', 1.4240578640230612, 1300406, 1414189),\n", " ('word', 1, 1, 426584))" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is interesting: above you see all the textual objects, with the average size of their objects,\n", "the node where they start, and the node where they end." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Count individual object types\n", "This is an intuitive way to count the number of nodes in each type.\n", "Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed \n", "and indented progress messages." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:57.806821Z", "start_time": "2018-05-18T09:17:57.558523Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s counting objects ...\n", " | 0.00s 39 books\n", " | 0.00s 929 chapters\n", " | 0.00s 9233 lexs\n", " | 0.01s 23213 verses\n", " | 0.01s 45180 half_verses\n", " | 0.01s 63711 sentences\n", " | 0.01s 64486 sentence_atoms\n", " | 0.01s 88101 clauses\n", " | 0.01s 90669 clause_atoms\n", " | 0.04s 253187 phrases\n", " | 0.04s 267519 phrase_atoms\n", " | 0.01s 113784 subphrases\n", " | 0.06s 426584 words\n", " 0.24s Done\n" ] } ], "source": [ "indent(reset=True)\n", "info('counting objects ...')\n", "\n", "for otype in F.otype.all:\n", " i = 0\n", "\n", " indent(level=1, reset=True)\n", "\n", " for n in F.otype.s(otype): i+=1\n", "\n", " info('{:>7} {}s'.format(i, otype))\n", "\n", "indent(level=0)\n", "info('Done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Viewing textual objects\n", "\n", "We use the BHSA API (the extra power) to peek into the corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First a word. Node 100,000 is a slot. Let's see what it is and where it is." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:02.282178Z", "start_time": "2018-05-18T09:18:02.274117Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
in
\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "wordShow = 100000\n", "B.pretty(wordShow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note \n", "* if you click on the word\n", " you go to a page in SHEBANQ that shows a list of all occurrences of this lexeme;\n", "* if you hover on the part-of-speech (`prep` here), you see the passage, \n", " and if you click on it, you go to SHEBANQ, to exactly this verse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us do the same for more complex objects, such as phrases, sentences, etc." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:04.566580Z", "start_time": "2018-05-18T09:18:04.557891Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "\n", "
\n", " phrase Frnt\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
end
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
<object marker>
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
word
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "phraseShow = 700001\n", "B.pretty(phraseShow)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:07.079086Z", "start_time": "2018-05-18T09:18:07.072541Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "\n", "
\n", " clause NA\n", " xYq0\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Conj\n", " CP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
if
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Adju\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
loyalty
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase PreO\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
find
\n", "
hif
\n", "
impf
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clauseShow = 500002\n", "B.pretty(clauseShow)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:08.036171Z", "start_time": "2018-05-18T09:18:08.027419Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "\n", "
\n", " sentence \n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " clause NA\n", " Way0\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Conj\n", " CP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Pred\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
bind
\n", "
qal
\n", "
wayq
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Cmpl\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
upon
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Samaria
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sentenceShow = 1200001\n", "B.pretty(sentenceShow)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:09.107442Z", "start_time": "2018-05-18T09:18:09.095951Z" }, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " sentence \n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " clause Adju\n", " xQt0\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Conj\n", " CP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
upon
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
<relative>
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Pred\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
be unfaithful
\n", "
qal
\n", "
perf
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Cmpl\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
in
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Adju\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
in
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
midst
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
son
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Israel
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Loca\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
in
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
water
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
quarrel
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Kadesh
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Loca\n", " NP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
desert
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Zin
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " clause Coor\n", " xQt0\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Conj\n", " CP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
upon
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
<relative>
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Nega\n", " NegP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
not
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Pred\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
be holy
\n", "
piel
\n", "
perf
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Objc\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
<object marker>
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Adju\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
in
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
midst
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
son
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Israel
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "verseShow = 1420000\n", "B.pretty(verseShow)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:12.160489Z", "start_time": "2018-05-18T09:18:12.150554Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chapter\n" ] }, { "data": { "text/html": [ "Isaiah 43" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chapterShow = 427000\n", "print(F.otype.v(chapterShow))\n", "\n", "B.pretty(chapterShow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you need a link to shebanq for just any node:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:15.022571Z", "start_time": "2018-05-18T09:18:15.016639Z" } }, "outputs": [ { "data": { "text/html": [ "1_Samuel 25:29" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "million = 1000000\n", "B.shbLink(million)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature statistics\n", "\n", "`F`\n", "gives access to all features.\n", "Every feature has a method\n", "`freqList()`\n", "to generate a frequency list of its values, higher frequencies first.\n", "Here are the parts of speech:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:18.039544Z", "start_time": "2018-05-18T09:18:17.784073Z" } }, "outputs": [ { "data": { "text/plain": [ "(('subs', 125558),\n", " ('verb', 75450),\n", " ('prep', 73298),\n", " ('conj', 62737),\n", " ('nmpr', 35696),\n", " ('art', 30387),\n", " ('adjv', 10075),\n", " ('nega', 6059),\n", " ('prps', 5035),\n", " ('advb', 4603),\n", " ('prde', 2678),\n", " ('intj', 1912),\n", " ('inrg', 1303),\n", " ('prin', 1026))" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.sp.freqList()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lexeme matters\n", "\n", "## Top 10 frequent verbs\n", "\n", "If we count the frequency of words, we usually mean the frequency of their\n", "corresponding lexemes.\n", "\n", "There are several methods for working with lexemes.\n", "\n", "### Method 1: counting words" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:22.590359Z", "start_time": "2018-05-18T09:18:22.247265Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Collecting data\n", " 0.33s Done\n", ">MR[: 5378\n", "HJH[: 3561\n", "[: 2570\n", "NTN[: 2017\n", "HLK[: 1554\n", "R>H[: 1298\n", "CM<[: 1168\n", "DBR[: 1138\n", "JCB[: 1082\n", "\n" ] } ], "source": [ "verbs = collections.Counter()\n", "indent(reset=True)\n", "info('Collecting data')\n", "\n", "for w in F.otype.s('word'):\n", " if F.sp.v(w) != 'verb': continue\n", " verbs[F.lex.v(w)] +=1\n", "\n", "info('Done')\n", "print(''.join(\n", " '{}: {}\\n'.format(verb, cnt) for (verb, cnt) in sorted(\n", " verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],\n", " )\n", ") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 2: counting lexemes\n", "\n", "An alternative way to do this is to use the feature `freq_lex`, defined for `lex` nodes.\n", "Now we walk the lexemes instead of the occurrences.\n", "\n", "Note that the feature `sp` (part-of-speech) is defined for nodes of type `word` as well as `lex`.\n", "Both also have the `lex` feature." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:25.695727Z", "start_time": "2018-05-18T09:18:25.667486Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Collecting data\n", " 0.01s Done\n", ">MR[: 5378\n", "HJH[: 3561\n", "[: 2570\n", "NTN[: 2017\n", "HLK[: 1554\n", "R>H[: 1298\n", "CM<[: 1168\n", "DBR[: 1138\n", "JCB[: 1082\n", "\n" ] } ], "source": [ "verbs = collections.Counter()\n", "indent(reset=True)\n", "info('Collecting data')\n", "for w in F.otype.s('lex'):\n", " if F.sp.v(w) != 'verb': continue\n", " verbs[F.lex.v(w)] += F.freq_lex.v(w)\n", "info('Done')\n", "print(''.join(\n", " '{}: {}\\n'.format(verb, cnt) for (verb, cnt) in sorted(\n", " verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],\n", " )\n", ") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an order of magnitude faster. In this case, that means the difference between a third of a second and a\n", "hundredth of a second, not a big gain in absolute terms.\n", "But suppose you need to run this a 1000 times in a loop.\n", "Then it is the difference between 5 minutes and 10 seconds.\n", "A five minute wait is not pleasant in interactive computing!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A frequency mapping of lexemes\n", "\n", "We make a mapping between lexeme forms and the number of occurrences of those lexemes." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "lexeme_dict = {\n", " F.g_lex_utf8.v(n): F.freq_lex.v(n) \n", " for n in F.otype.s('word')\n", "}" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('בְּ', 15542),\n", " ('רֵאשִׁית', 51),\n", " ('בָּרָא', 48),\n", " ('אֱלֹה', 2601),\n", " ('אֵת', 10997),\n", " ('הַ', 30386),\n", " ('שָּׁמַי', 421),\n", " ('וְ', 50272),\n", " ('הָ', 30386),\n", " ('אָרֶץ', 2504)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(lexeme_dict.items())[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Real work\n", "\n", "As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on \n", "[Collocation MI Analysis of the Hebrew Bible](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/blob/master/Collocation%20MI%20Analysis%20of%20the%20Hebrew%20Bible.ipynb)\n", "\n", "It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results.\n", "\n", "In case the name has changed, the enclosing repo is\n", "[here](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/tree/master/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lexeme distribution\n", "\n", "Let's do a bit more fancy lexeme stuff.\n", "\n", "### Hapaxes\n", "\n", "A hapax can be found by inspecting lexemes and see to how many word nodes they are linked.\n", "If that is number is one, we have a hapax.\n", "\n", "We print 10 hapaxes with their glosses." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:31.003571Z", "start_time": "2018-05-18T09:18:30.839888Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.17s 3072 hapaxes found\n", "No zeroes found\n", "\tPJCWN/ Pishon\n", "\tCWP[ bruise\n", "\tHRWN/ pregnancy\n", "\tZL/ Mehujael\n", "\tMXJJ>L/ Mehujael\n", "\tJBL=/ Jabal\n" ] } ], "source": [ "indent(reset=True)\n", "\n", "hapax = []\n", "zero = set()\n", "\n", "for l in F.otype.s('lex'):\n", " occs = L.d(l, otype='word')\n", " n = len(occs)\n", " if n == 0: # that's weird: should not happen\n", " zero.add(l)\n", " elif n == 1: # hapax found!\n", " hapax.append(l)\n", "\n", "info('{} hapaxes found'.format(len(hapax)))\n", "\n", "if zero:\n", " error('{} zeroes found'.format(len(zero)), tm=False)\n", "else:\n", " info('No zeroes found', tm=False)\n", "for h in hapax[0:10]:\n", " print('\\t{:<8} {}'.format(F.lex.v(h), F.gloss.v(h)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Small occurrence base\n", "\n", "The occurrence base of a lexeme are the verses, chapters and books in which occurs.\n", "Let's look for lexemes that occur in a single chapter.\n", "\n", "If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter.\n", "So, if you go *up* from the lexeme, you encounter the chapter.\n", "\n", "Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it,\n", "so if you go up from such lexemes, you don not find chapters.\n", "\n", "Let's check it out.\n", "\n", "Oh yes, we have already found the hapaxes, we will skip them here." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:36.257701Z", "start_time": "2018-05-18T09:18:36.082461Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Finding single chapter lexemes\n", " 0.16s 450 single chapter lexemes found\n", "No chapter embedders of multiple lexemes found\n", "Genesis 4:1 QJN=/ \n", "Genesis 4:2 HBL=/ \n", "Genesis 4:18 L/\n", "Genesis 4:19 YLH/ \n", "Genesis 4:22 TWBL_QJN/\n", "Genesis 10:11 KLX=/ \n", "Genesis 14:1 >MRPL/\n", "Genesis 14:1 >RJWK/\n", "Genesis 14:1 >LSR/ \n" ] } ], "source": [ "indent(reset=True)\n", "info('Finding single chapter lexemes')\n", "\n", "singleCh = []\n", "multiple = []\n", "\n", "for l in F.otype.s('lex'):\n", " chapters = L.u(l, 'chapter')\n", " if len(chapters) == 1:\n", " if l not in hapax:\n", " singleCh.append(l)\n", " elif len(chapters) > 0: # should not happen\n", " multipleCh.append(l)\n", "\n", "info('{} single chapter lexemes found'.format(len(singleCh)))\n", "\n", "if multiple:\n", " error('{} chapter embedders of multiple lexemes found'.format(len(multiple)), tm=False)\n", "else:\n", " info('No chapter embedders of multiple lexemes found', tm=False)\n", "for s in singleCh[0:10]:\n", " print('{:<20} {:<6}'.format(\n", " '{} {}:{}'.format(*T.sectionFromNode(s)),\n", " F.lex.v(s),\n", " ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confined to books\n", "\n", "As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and\n", "the number of lexemes that occur exclusively in that book." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:43.959960Z", "start_time": "2018-05-18T09:18:39.536067Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Making book-lexeme index\n", " 4.48s Found 9233 lexemes\n" ] } ], "source": [ "indent(reset=True)\n", "info('Making book-lexeme index')\n", "\n", "allBook = collections.defaultdict(set)\n", "allLex = set()\n", "\n", "for b in F.otype.s('book'):\n", " for w in L.d(b, 'word'):\n", " l = L.u(w, 'lex')[0]\n", " allBook[b].add(l)\n", " allLex.add(l)\n", "\n", "info('Found {} lexemes'.format(len(allLex)))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:45.949852Z", "start_time": "2018-05-18T09:18:45.892985Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Finding single book lexemes\n", " 0.05s found 4226 single book lexemes\n" ] } ], "source": [ "indent(reset=True)\n", "info('Finding single book lexemes')\n", "\n", "singleBook = collections.defaultdict(lambda:0)\n", "for l in F.otype.s('lex'):\n", " book = L.u(l, 'book')\n", " if len(book) == 1:\n", " singleBook[book[0]] += 1\n", "\n", "info('found {} single book lexemes'.format(sum(singleBook.values())))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:52.143337Z", "start_time": "2018-05-18T09:18:52.130385Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "book #all #own %own\n", "-----------------------------------\n", "Daniel 1121 428 38.2%\n", "1_Chronicles 2015 488 24.2%\n", "Ezra 991 199 20.1%\n", "Joshua 1175 206 17.5%\n", "Esther 472 67 14.2%\n", "Isaiah 2553 350 13.7%\n", "Numbers 1457 197 13.5%\n", "Ezekiel 1718 212 12.3%\n", "Song_of_songs 503 60 11.9%\n", "Job 1717 202 11.8%\n", "Genesis 1817 208 11.4%\n", "Nehemiah 1076 110 10.2%\n", "Psalms 2251 216 9.6%\n", "Leviticus 960 89 9.3%\n", "Judges 1210 99 8.2%\n", "Ecclesiastes 575 46 8.0%\n", "Proverbs 1356 103 7.6%\n", "Jeremiah 1949 147 7.5%\n", "2_Samuel 1304 89 6.8%\n", "1_Samuel 1256 85 6.8%\n", "2_Kings 1266 85 6.7%\n", "Exodus 1425 92 6.5%\n", "1_Kings 1291 81 6.3%\n", "Deuteronomy 1449 80 5.5%\n", "Lamentations 592 31 5.2%\n", "2_Chronicles 1411 67 4.7%\n", "Nahum 357 16 4.5%\n", "Hosea 742 33 4.4%\n", "Ruth 319 14 4.4%\n", "Habakkuk 393 17 4.3%\n", "Amos 652 27 4.1%\n", "Joel 398 14 3.5%\n", "Zechariah 726 25 3.4%\n", "Obadiah 167 5 3.0%\n", "Micah 586 16 2.7%\n", "Zephaniah 367 10 2.7%\n", "Jonah 252 5 2.0%\n", "Haggai 208 3 1.4%\n", "Malachi 314 4 1.3%\n" ] } ], "source": [ "print('{:<20}{:>5}{:>5}{:>5}\\n{}'.format(\n", " 'book', '#all', '#own', '%own',\n", " '-'*35,\n", "))\n", "booklist = []\n", "\n", "for b in F.otype.s('book'):\n", " book = T.bookName(b)\n", " a = len(allBook[b])\n", " o = singleBook.get(b, 0)\n", " p = 100 * o / a\n", " booklist.append((book, a, o, p))\n", "\n", "for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):\n", " print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The book names may sound a bit unfamiliar, they are in Latin here.\n", "Later we'll see that you can also get them in English, or in Swahili." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Locality API\n", "We travel upwards and downwards, forwards and backwards through the nodes.\n", "The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down,\n", "`n()` for going to next nodes and `p()` for going to previous nodes.\n", "\n", "These directions are indirect notions: nodes are just numbers, but by means of the\n", "`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.\n", "And one if next or previous to an other, if its slots follow or precede the slots of the other one.\n", "\n", "`L.u(node)` **Up** is going to nodes that embed `node`.\n", "\n", "`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.\n", "\n", "`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.\n", "\n", "`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.\n", "\n", "All these functions yield nodes of all possible otypes.\n", "By passing an optional parameter, you can restrict the results to nodes of that type.\n", "\n", "The result are ordered according to the order of things in the text.\n", "\n", "The functions return always a tuple, even if there is just one node in the result.\n", "\n", "## Going up\n", "We go from the first word to the book it contains.\n", "Note the `[0]` at the end. You expect one book, yet `L` returns a tuple. \n", "To get the only element of that tuple, you need to do that `[0]`.\n", "\n", "If you are like me, you keep forgetting it, and that will lead to weird error messages later on." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:55.410034Z", "start_time": "2018-05-18T09:18:55.404051Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "426585\n" ] } ], "source": [ "firstBook = L.u(1, otype='book')[0]\n", "print(firstBook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's see all the containing objects of word 3:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:56.772513Z", "start_time": "2018-05-18T09:18:56.766324Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word 3 is contained in book 426585\n", "word 3 is contained in chapter 426624\n", "word 3 is contained in lex 1437405\n", "word 3 is contained in verse 1414190\n", "word 3 is contained in half_verse 606323\n", "word 3 is contained in sentence 1172209\n", "word 3 is contained in sentence_atom 1235920\n", "word 3 is contained in clause 427553\n", "word 3 is contained in clause_atom 515654\n", "word 3 is contained in phrase 651504\n", "word 3 is contained in phrase_atom 904691\n", "word 3 is contained in subphrase x\n" ] } ], "source": [ "w = 3\n", "for otype in F.otype.all:\n", " if otype == F.otype.slotType: continue\n", " up = L.u(w, otype=otype)\n", " upNode = 'x' if len(up) == 0 else up[0]\n", " print('word {} is contained in {} {}'.format(w, otype, upNode))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going next\n", "Let's go to the next nodes of the first book." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:58.821681Z", "start_time": "2018-05-18T09:18:58.814893Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 28764: word first slot=28764 , last slot=28764 \n", " 923447: phrase_atom first slot=28764 , last slot=28764 \n", " 669484: phrase first slot=28764 , last slot=28764 \n", " 521793: clause_atom first slot=28764 , last slot=28768 \n", " 433543: clause first slot=28764 , last slot=28768 \n", " 609323: half_verse first slot=28764 , last slot=28771 \n", "1240568: sentence_atom first slot=28764 , last slot=28773 \n", "1176828: sentence first slot=28764 , last slot=28792 \n", "1415723: verse first slot=28764 , last slot=28777 \n", " 426674: chapter first slot=28764 , last slot=29112 \n", " 426586: book first slot=28764 , last slot=52511 \n" ] } ], "source": [ "afterFirstBook = L.n(firstBook)\n", "for n in afterFirstBook:\n", " print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(\n", " n, F.otype.v(n),\n", " E.oslots.s(n)[0],\n", " E.oslots.s(n)[-1],\n", " ))\n", "secondBook = L.n(firstBook, otype='book')[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going previous\n", "\n", "And let's see what is right before the second book." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:00.163973Z", "start_time": "2018-05-18T09:19:00.154857Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 426585: book first slot=1 , last slot=28763 \n", " 426673: chapter first slot=28259 , last slot=28763 \n", "1415722: verse first slot=28746 , last slot=28763 \n", " 609322: half_verse first slot=28754 , last slot=28763 \n", "1176827: sentence first slot=28757 , last slot=28763 \n", "1240567: sentence_atom first slot=28757 , last slot=28763 \n", " 433542: clause first slot=28757 , last slot=28763 \n", " 521792: clause_atom first slot=28757 , last slot=28763 \n", " 669483: phrase first slot=28762 , last slot=28763 \n", " 923446: phrase_atom first slot=28762 , last slot=28763 \n", " 28763: word first slot=28763 , last slot=28763 \n" ] } ], "source": [ "for n in L.p(secondBook):\n", " print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(\n", " n, F.otype.v(n),\n", " E.oslots.s(n)[0],\n", " E.oslots.s(n)[-1],\n", " ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going down" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We go to the chapters of the second book, and just count them." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:02.530705Z", "start_time": "2018-05-18T09:19:02.475279Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40\n" ] } ], "source": [ "chapters = L.d(secondBook, otype='chapter')\n", "print(len(chapters))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The first verse\n", "We pick the first verse and the first word, and explore what is above and below them." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:04.024679Z", "start_time": "2018-05-18T09:19:03.995207Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Node 1\n", " | UP\n", " | | 1437403 lex\n", " | | 904690 phrase_atom\n", " | | 651503 phrase\n", " | | 606323 half_verse\n", " | | 515654 clause_atom\n", " | | 427553 clause\n", " | | 1235920 sentence_atom\n", " | | 1172209 sentence\n", " | | 1414190 verse\n", " | | 426624 chapter\n", " | | 426585 book\n", " | DOWN\n", " | | \n", "Node 1414190\n", " | UP\n", " | | 426624 chapter\n", " | | 426585 book\n", " | DOWN\n", " | | 1172209 sentence\n", " | | 1235920 sentence_atom\n", " | | 427553 clause\n", " | | 515654 clause_atom\n", " | | 606323 half_verse\n", " | | 651503 phrase\n", " | | 904690 phrase_atom\n", " | | 1 word\n", " | | 2 word\n", " | | 651504 phrase\n", " | | 904691 phrase_atom\n", " | | 3 word\n", " | | 651505 phrase\n", " | | 904692 phrase_atom\n", " | | 4 word\n", " | | 606324 half_verse\n", " | | 651506 phrase\n", " | | 904693 phrase_atom\n", " | | 1300406 subphrase\n", " | | 5 word\n", " | | 6 word\n", " | | 7 word\n", " | | 8 word\n", " | | 1300407 subphrase\n", " | | 9 word\n", " | | 10 word\n", " | | 11 word\n", "Done\n" ] } ], "source": [ "for n in [1, L.u(1, otype='verse')[0]]:\n", " indent(level=0)\n", " info('Node {}'.format(n), tm=False)\n", " indent(level=1)\n", " info('UP', tm=False)\n", " indent(level=2)\n", " info('\\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)\n", " indent(level=1)\n", " info('DOWN', tm=False)\n", " indent(level=2)\n", " info('\\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)\n", "indent(level=0)\n", "info('Done', tm=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text API\n", "\n", "So far, we have mainly seen nodes and their numbers, and the names of node types.\n", "You would almost forget that we are dealing with text.\n", "So let's try to see some text.\n", "\n", "In the same way as `F` gives access to feature data,\n", "`T` gives access to the text.\n", "That is also feature data, but you can tell Text-Fabric which features are specifically\n", "carrying the text, and in return Text-Fabric offers you\n", "a Text API: `T`.\n", "\n", "## Formats\n", "Hebrew text can be represented in a number of ways:\n", "\n", "* fully pointed (vocalized and accented), or consonantal,\n", "* in transliteration, phonetic transcription or in Hebrew characters,\n", "* showing the actual text or only the lexemes,\n", "* following the ketiv or the qere, at places where they deviate from each other.\n", "\n", "If you wonder where the information about text formats is stored: \n", "not in the program text-fabric, but in the data set.\n", "It has a feature `otext`, which specifies the formats and which features\n", "must be used to produce them. `otext` is the third special feature in a TF data set,\n", "next to `otype` and `oslots`. \n", "It is an optional feature. \n", "If it is absent, there will be no `T` API.\n", "\n", "Here is a list of all available formats in this data set." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:05.606582Z", "start_time": "2018-05-18T09:19:05.593486Z" } }, "outputs": [ { "data": { "text/plain": [ "['lex-orig-full',\n", " 'lex-orig-plain',\n", " 'lex-trans-full',\n", " 'lex-trans-plain',\n", " 'text-orig-full',\n", " 'text-orig-full-ketiv',\n", " 'text-orig-plain',\n", " 'text-phono-full',\n", " 'text-trans-full',\n", " 'text-trans-full-ketiv',\n", " 'text-trans-plain']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(T.formats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the `text-phono-full` format here.\n", "It does not come from the main data source `bhsa`, but from the module `phono`.\n", "Look in your data directory, find `~/github/etcbc/phono/tf/2017/otext@phono.tf`,\n", "and you'll see this format defined there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the formats\n", "Now let's use those formats to print out the first verse of the Hebrew Bible." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:10.077589Z", "start_time": "2018-05-18T09:19:10.070503Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lex-orig-full:\n", "\tבְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ \n", "lex-orig-plain:\n", "\tב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ \n", "lex-trans-full:\n", "\tB.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY \n", "lex-trans-plain:\n", "\tB R>CJT BR> >LHJM >T H CMJM W >T H >RY \n", "text-orig-full:\n", "\tבְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "text-orig-full-ketiv:\n", "\tבְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "text-orig-plain:\n", "\tבראשׁית ברא אלהים את השׁמים ואת הארץ׃ \n", "text-phono-full:\n", "\tbᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . \n", "text-trans-full:\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "text-trans-full-ketiv:\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "text-trans-plain:\n", "\tBR>CJT BR> >LHJM >T HCMJM W>T H>RY00 \n" ] } ], "source": [ "for fmt in sorted(T.formats):\n", " print('{}:\\n\\t{}'.format(fmt, T.text(range(1,12), fmt=fmt)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we do not specify a format, the **default** format is used (`text-orig-full`)." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:13.490426Z", "start_time": "2018-05-18T09:19:13.486053Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n" ] } ], "source": [ "print(T.text(range(1,12)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Whole text in all formats in just 10 seconds\n", "Part of the pleasure of working with computers is that they can crunch massive amounts of data.\n", "The text of the Hebrew Bible is a piece of cake.\n", "\n", "It takes just ten seconds to have that cake and eat it. \n", "In nearly a dozen formats." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:27.839331Z", "start_time": "2018-05-18T09:19:18.526400Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s writing plain text of whole Bible in all formats\n", " 9.32s done 11 formats\n", "lex-orig-full\n", "בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ \n", "וְ הָ אָרֶץ הָי תֹהוּ וָ בֹהוּ וְ חֹשֶׁךְ עַל פְּן תְהֹום וְ רוּחַ אֱלֹה רַחֶף עַל פְּן הַ מָּי \n", "וַ אמֶר אֱלֹה הִי אֹור וַ הִי אֹור \n", "וַ רְא אֱלֹה אֶת הָ אֹור כִּי טֹוב וַ בְדֵּל אֱלֹה בֵּין הָ אֹור וּ בֵין הַ חֹשֶׁךְ \n", "וַ קְרָא אֱלֹה לָ אֹור יֹום וְ לַ חֹשֶׁךְ קָרָא לָיְלָה וַ הִי עֶרֶב וַ הִי בֹקֶר יֹום אֶחָד \n", "\n", "lex-orig-plain\n", "ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ \n", "ו ה ארץ היה תהו ו בהו ו חשׁך על פנה תהום ו רוח אלהים רחף על פנה ה מים \n", "ו אמר אלהים היה אור ו היה אור \n", "ו ראה אלהים את ה אור כי טוב ו בדל אלהים בין ה אור ו בין ה חשׁך \n", "ו קרא אלהים ל ה אור יום ו ל ה חשׁך קרא לילה ו היה ערב ו היה בקר יום אחד \n", "\n", "lex-trans-full\n", "B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY \n", "W:- H@- >@REY H@J TOHW. W@- BOHW. W:- XOCEK: :ELOH RAXEP MER >:ELOH HIJ >OWR WA- HIJ >OWR \n", "WA- R:> >:ELOH >ET H@- >OWR K.IJ VOWB WA- B:D.;L >:ELOH B.;JN H@- >OWR W.- B;JN HA- XOCEK: \n", "WA- Q:R@> >:ELOH L@- - >OWR JOWM W:- LA- - XOCEK: Q@R@> L@J:L@H WA- HIJ EX@D \n", "\n", "lex-trans-plain\n", "B R>CJT BR> >LHJM >T H CMJM W >T H >RY \n", "W H >RY HJH THW W BHW W XCK LHJM RXP MR >LHJM HJH >WR W HJH >WR \n", "W R>H >LHJM >T H >WR KJ VWB W BDL >LHJM BJN H >WR W BJN H XCK \n", "W QR> >LHJM L H >WR JWM W L H XCK QR> LJLH W HJH XD \n", "\n", "text-orig-full\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ \n", "וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ \n", "וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ \n", "וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ \n", "\n", "text-orig-full-ketiv\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ \n", "וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ \n", "וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ \n", "וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ \n", "\n", "text-orig-plain\n", "בראשׁית ברא אלהים את השׁמים ואת הארץ׃ \n", "והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃ \n", "ויאמר אלהים יהי אור ויהי־אור׃ \n", "וירא אלהים את־האור כי־טוב ויבדל אלהים בין האור ובין החשׁך׃ \n", "ויקרא אלהים׀ לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ \n", "\n", "text-phono-full\n", "bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . \n", "wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim . \n", "wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr . \n", "wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ . \n", "wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f \n", "\n", "text-trans-full\n", "B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: :ELOHI80JM M:RAXE73PET MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 \n", "WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 \n", "WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&EX@75D00_P \n", "\n", "text-trans-full-ketiv\n", "B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: :ELOHI80JM M:RAXE73PET MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 \n", "WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 \n", "WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&EX@75D00_P \n", "\n", "text-trans-plain\n", "BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 \n", "WH>RY HJTH THW WBHW WXCK LHJM MRXPT MR >LHJM JHJ >WR WJHJ&>WR00 \n", "WJR> >LHJM >T&H>WR KJ&VWB WJBDL >LHJM BJN H>WR WBJN HXCK00 \n", "WJQR> >LHJM05 L>WR JWM WLXCK QR> LJLH WJHJ&XD00_P \n", "\n" ] } ], "source": [ "indent(reset=True)\n", "info('writing plain text of whole Bible in all formats')\n", "\n", "text = collections.defaultdict(list)\n", "\n", "for v in F.otype.s('verse'):\n", " words = L.d(v, 'word')\n", " for fmt in sorted(T.formats):\n", " text[fmt].append(T.text(words, fmt=fmt))\n", "\n", "info('done {} formats'.format(len(text)))\n", "\n", "for fmt in sorted(text):\n", " print('{}\\n{}\\n'.format(fmt, '\\n'.join(text[fmt][0:5])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The full plain text\n", "We write a few formats to file, in your `Downloads` folder." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:33.220077Z", "start_time": "2018-05-18T09:19:33.212947Z" } }, "outputs": [ { "data": { "text/plain": [ "{'lex-orig-full',\n", " 'lex-orig-plain',\n", " 'lex-trans-full',\n", " 'lex-trans-plain',\n", " 'text-orig-full',\n", " 'text-orig-full-ketiv',\n", " 'text-orig-plain',\n", " 'text-phono-full',\n", " 'text-trans-full',\n", " 'text-trans-full-ketiv',\n", " 'text-trans-plain'}" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.formats" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:34.250294Z", "start_time": "2018-05-18T09:19:34.156658Z" } }, "outputs": [], "source": [ "for fmt in '''\n", " text-orig-full\n", " text-phono-full\n", "'''.strip().split():\n", " with open(os.path.expanduser(f'~/Downloads/{fmt}.txt'), 'w') as f:\n", " f.write('\\n'.join(text[fmt]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Book names\n", "\n", "For Bible book names, we can use several languages.\n", "\n", "### Languages\n", "Here are the languages that we can use for book names.\n", "These languages come from the features `book@ll`, where `ll` is a two letter\n", "ISO language code. Have a look in your data directory, you can't miss them." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:36.977529Z", "start_time": "2018-05-18T09:19:36.969202Z" } }, "outputs": [ { "data": { "text/plain": [ "{'': {'language': 'default', 'languageEnglish': 'default'},\n", " 'am': {'language': 'ኣማርኛ', 'languageEnglish': 'amharic'},\n", " 'ar': {'language': 'العَرَبِية', 'languageEnglish': 'arabic'},\n", " 'bn': {'language': 'বাংলা', 'languageEnglish': 'bengali'},\n", " 'da': {'language': 'Dansk', 'languageEnglish': 'danish'},\n", " 'de': {'language': 'Deutsch', 'languageEnglish': 'german'},\n", " 'el': {'language': 'Ελληνικά', 'languageEnglish': 'greek'},\n", " 'en': {'language': 'English', 'languageEnglish': 'english'},\n", " 'es': {'language': 'Español', 'languageEnglish': 'spanish'},\n", " 'fa': {'language': 'فارسی', 'languageEnglish': 'farsi'},\n", " 'fr': {'language': 'Français', 'languageEnglish': 'french'},\n", " 'he': {'language': 'עברית', 'languageEnglish': 'hebrew'},\n", " 'hi': {'language': 'हिन्दी', 'languageEnglish': 'hindi'},\n", " 'id': {'language': 'Bahasa Indonesia', 'languageEnglish': 'indonesian'},\n", " 'ja': {'language': '日本語', 'languageEnglish': 'japanese'},\n", " 'ko': {'language': '한국어', 'languageEnglish': 'korean'},\n", " 'la': {'language': 'Latina', 'languageEnglish': 'latin'},\n", " 'nl': {'language': 'Nederlands', 'languageEnglish': 'dutch'},\n", " 'pa': {'language': 'ਪੰਜਾਬੀ', 'languageEnglish': 'punjabi'},\n", " 'pt': {'language': 'Português', 'languageEnglish': 'portuguese'},\n", " 'ru': {'language': 'Русский', 'languageEnglish': 'russian'},\n", " 'sw': {'language': 'Kiswahili', 'languageEnglish': 'swahili'},\n", " 'syc': {'language': 'ܠܫܢܐ ܣܘܪܝܝܐ', 'languageEnglish': 'syriac'},\n", " 'tr': {'language': 'Türkçe', 'languageEnglish': 'turkish'},\n", " 'ur': {'language': 'اُردُو', 'languageEnglish': 'urdu'},\n", " 'yo': {'language': 'èdè Yorùbá', 'languageEnglish': 'yoruba'},\n", " 'zh': {'language': '中文', 'languageEnglish': 'chinese'}}" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.languages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Book names in Swahili\n", "Get the book names in Swahili." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:38.495048Z", "start_time": "2018-05-18T09:19:38.488011Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "426585 = Mwanzo\n", "426586 = Kutoka\n", "426587 = Mambo_ya_Walawi\n", "426588 = Hesabu\n", "426589 = Kumbukumbu_la_Torati\n", "426590 = Yoshua\n", "426591 = Waamuzi\n", "426592 = 1_Samweli\n", "426593 = 2_Samweli\n", "426594 = 1_Wafalme\n", "426595 = 2_Wafalme\n", "426596 = Isaya\n", "426597 = Yeremia\n", "426598 = Ezekieli\n", "426599 = Hosea\n", "426600 = Yoeli\n", "426601 = Amosi\n", "426602 = Obadia\n", "426603 = Yona\n", "426604 = Mika\n", "426605 = Nahumu\n", "426606 = Habakuki\n", "426607 = Sefania\n", "426608 = Hagai\n", "426609 = Zekaria\n", "426610 = Malaki\n", "426611 = Zaburi\n", "426612 = Ayubu\n", "426613 = Mithali\n", "426614 = Ruthi\n", "426615 = Wimbo_Ulio_Bora\n", "426616 = Mhubiri\n", "426617 = Maombolezo\n", "426618 = Esta\n", "426619 = Danieli\n", "426620 = Ezra\n", "426621 = Nehemia\n", "426622 = 1_Mambo_ya_Nyakati\n", "426623 = 2_Mambo_ya_Nyakati\n", "\n" ] } ], "source": [ "nodeToSwahili = ''\n", "for b in F.otype.s('book'):\n", " nodeToSwahili += '{} = {}\\n'.format(b, T.bookName(b, lang='sw'))\n", "print(nodeToSwahili)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Book nodes from Swahili\n", "OK, there they are. We copy them into a string, and do the opposite: get the nodes back.\n", "We check whether we get exactly the same nodes as the ones we started with." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:40.311912Z", "start_time": "2018-05-18T09:19:40.302946Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Going from nodes to booknames and back yields the original nodes\n" ] } ], "source": [ "swahiliNames = '''\n", "Mwanzo\n", "Kutoka\n", "Mambo_ya_Walawi\n", "Hesabu\n", "Kumbukumbu_la_Torati\n", "Yoshua\n", "Waamuzi\n", "1_Samweli\n", "2_Samweli\n", "1_Wafalme\n", "2_Wafalme\n", "Isaya\n", "Yeremia\n", "Ezekieli\n", "Hosea\n", "Yoeli\n", "Amosi\n", "Obadia\n", "Yona\n", "Mika\n", "Nahumu\n", "Habakuki\n", "Sefania\n", "Hagai\n", "Zekaria\n", "Malaki\n", "Zaburi\n", "Ayubu\n", "Mithali\n", "Ruthi\n", "Wimbo_Ulio_Bora\n", "Mhubiri\n", "Maombolezo\n", "Esta\n", "Danieli\n", "Ezra\n", "Nehemia\n", "1_Mambo_ya_Nyakati\n", "2_Mambo_ya_Nyakati\n", "'''.strip().split()\n", "\n", "swahiliToNode = ''\n", "for nm in swahiliNames:\n", " swahiliToNode += '{} = {}\\n'.format(T.bookNode(nm, lang='sw'), nm)\n", " \n", "if swahiliToNode != nodeToSwahili:\n", " print('Something is not right with the book names')\n", "else:\n", " print('Going from nodes to booknames and back yields the original nodes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sections\n", "\n", "A section in the Hebrew bible is a book, a chapter or a verse.\n", "Knowledge of sections is not baked into Text-Fabric. \n", "The config feature `otext.tf` may specify three section levels, and tell\n", "what the corresponding node types and features are.\n", "\n", "From that knowledge it can construct mappings from nodes to sections, e.g. from verse\n", "nodes to tuples of the form:\n", "\n", " (bookName, chapterNumber, verseNumber)\n", " \n", "Here are examples of getting the section that corresponds to a node and vice versa.\n", "\n", "**NB:** `sectionFromNode` always delivers a verse specification, either from the\n", "first slot belonging to that node, or, if `lastSlot`, from the last slot\n", "belonging to that node." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:43.056511Z", "start_time": "2018-05-18T09:19:43.043552Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "section of first word ('Genesis', 1, 1)\n", "node of Gen 1:1 1414190\n", "idem 1414190\n", "node of book Genesis 426585\n", "node of Genesis 1 426624\n", "section of book node ('Jeremiah', 36, 13)\n", "idem, now last word ('Jeremiah', 36, 13)\n", "section of chapter node ('Jeremiah', 36, 21)\n", "idem, now last word ('Jeremiah', 36, 21)\n" ] } ], "source": [ "for x in (\n", " ('section of first word', T.sectionFromNode(1) ),\n", " ('node of Gen 1:1', T.nodeFromSection(('Genesis', 1, 1)) ),\n", " ('idem', T.nodeFromSection(('Mwanzo', 1, 1), lang='sw') ),\n", " ('node of book Genesis', T.nodeFromSection(('Genesis',)) ),\n", " ('node of Genesis 1', T.nodeFromSection(('Genesis', 1)) ),\n", " ('section of book node', T.sectionFromNode(1367534) ),\n", " ('idem, now last word', T.sectionFromNode(1367534, lastSlot=True) ),\n", " ('section of chapter node', T.sectionFromNode(1367573) ),\n", " ('idem, now last word', T.sectionFromNode(1367573, lastSlot=True) ),\n", "): print('{:<30} {}'.format(*x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sentences spanning multiple verses\n", "If you go up from a sentence node, you expect to find a verse node.\n", "But some sentences span multiple verses, and in that case, you will not find the enclosing\n", "verse node, because it is not there.\n", "\n", "Here is a piece of code to detect and list all cases where sentences span multiple verses.\n", "\n", "The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to\n", "discover the verse in which that word occurs, and if they are different: bingo!\n", "\n", "We show the first 5 of ca. 900 cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way: doing this in the `2016` version of the data yields 915 results.\n", "The splitting up of the text into sentences is not carved in stone!" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:53.984718Z", "start_time": "2018-05-18T09:19:49.190240Z" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Get sentences that span multiple verses\n", " 4.88s Found 892 cases\n", " 4.88s \n", "Genesis 1:17-18\n", "Genesis 1:29-30\n", "Genesis 2:4-7\n", "Genesis 7:2-3\n", "Genesis 7:8-9\n", "Genesis 7:13-14\n", "Genesis 9:9-10\n", "Genesis 10:11-12\n", "Genesis 10:13-14\n", "Genesis 10:15-18\n" ] } ], "source": [ "indent(reset=True)\n", "info('Get sentences that span multiple verses')\n", "\n", "spanSentences = []\n", "for s in F.otype.s('sentence'):\n", " f = T.sectionFromNode(s, lastSlot=False)\n", " l = T.sectionFromNode(s, lastSlot=True)\n", " if f != l:\n", " spanSentences.append('{} {}:{}-{}'.format(f[0], f[1], f[2], l[2]))\n", "\n", "info('Found {} cases'.format(len(spanSentences)))\n", "info('\\n{}'.format('\\n'.join(spanSentences[0:10])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A different way, with better display, is:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:59.897561Z", "start_time": "2018-05-18T09:19:58.291284Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Get sentences that span multiple verses\n", " 1.64s Found 892 cases\n" ] }, { "data": { "text/markdown": [ "n | sentence | verse | verse\n", "--- | --- | --- | ---\n", "1 | וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ | Genesis 1:17 | Genesis 1:18\n", "2 | הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה | Genesis 1:29 | Genesis 1:30\n", "3 | בְּיֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְשָׁמָֽיִם׃ וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה | Genesis 2:4 | Genesis 2:7\n", "4 | מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו וּמִן־הַבְּהֵמָ֡ה אֲ֠שֶׁר לֹ֣א טְהֹרָ֥ה הִ֛וא שְׁנַ֖יִם אִ֥ישׁ וְאִשְׁתֹּֽו׃ גַּ֣ם מֵעֹ֧וף הַשָּׁמַ֛יִם שִׁבְעָ֥ה שִׁבְעָ֖ה זָכָ֣ר וּנְקֵבָ֑ה לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃ | Genesis 7:2 | Genesis 7:3\n", "5 | מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה אֲשֶׁ֥ר אֵינֶ֖נָּה טְהֹרָ֑ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל אֲשֶׁר־רֹמֵ֖שׂ עַל־הָֽאֲדָמָֽה׃ שְׁנַ֨יִם שְׁנַ֜יִם בָּ֧אוּ אֶל־נֹ֛חַ אֶל־הַתֵּבָ֖ה זָכָ֣ר וּנְקֵבָ֑ה כַּֽאֲשֶׁ֛ר צִוָּ֥ה אֱלֹהִ֖ים אֶת־נֹֽחַ׃ | Genesis 7:8 | Genesis 7:9\n", "6 | בְּעֶ֨צֶם הַיֹּ֤ום הַזֶּה֙ בָּ֣א נֹ֔חַ וְשֵׁם־וְחָ֥ם וָיֶ֖פֶת בְּנֵי־נֹ֑חַ וְאֵ֣שֶׁת נֹ֗חַ וּשְׁלֹ֧שֶׁת נְשֵֽׁי־בָנָ֛יו אִתָּ֖ם אֶל־הַתֵּבָֽה׃ הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ הָרֹמֵ֥שׂ עַל־הָאָ֖רֶץ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃ | Genesis 7:13 | Genesis 7:14\n", "7 | וַאֲנִ֕י הִנְנִ֥י מֵקִ֛ים אֶת־בְּרִיתִ֖י אִתְּכֶ֑ם וְאֶֽת־זַרְעֲכֶ֖ם אַֽחֲרֵיכֶֽם׃ וְאֵ֨ת כָּל־נֶ֤פֶשׁ הַֽחַיָּה֙ אֲשֶׁ֣ר אִתְּכֶ֔ם בָּעֹ֧וף בַּבְּהֵמָ֛ה וּֽבְכָל־חַיַּ֥ת הָאָ֖רֶץ אִתְּכֶ֑ם מִכֹּל֙ יֹצְאֵ֣י הַתֵּבָ֔ה לְכֹ֖ל חַיַּ֥ת הָאָֽרֶץ׃ | Genesis 9:9 | Genesis 9:10\n", "8 | וַיִּ֨בֶן֙ אֶת־נִ֣ינְוֵ֔ה וְאֶת־רְחֹבֹ֥ת עִ֖יר וְאֶת־כָּֽלַח׃ וְֽאֶת־רֶ֔סֶן בֵּ֥ין נִֽינְוֵ֖ה וּבֵ֣ין כָּ֑לַח | Genesis 10:11 | Genesis 10:12\n", "9 | וּמִצְרַ֡יִם יָלַ֞ד אֶת־לוּדִ֧ים וְאֶת־עֲנָמִ֛ים וְאֶת־לְהָבִ֖ים וְאֶת־נַפְתֻּחִֽים׃ וְֽאֶת־פַּתְרֻסִ֞ים וְאֶת־כַּסְלֻחִ֗ים אֲשֶׁ֨ר יָצְא֥וּ מִשָּׁ֛ם פְּלִשְׁתִּ֖ים וְאֶת־כַּפְתֹּרִֽים׃ ס | Genesis 10:13 | Genesis 10:14\n", "10 | וּכְנַ֗עַן יָלַ֛ד אֶת־צִידֹ֥ן בְּכֹרֹ֖ו וְאֶת־חֵֽת׃ וְאֶת־הַיְבוּסִי֙ וְאֶת־הָ֣אֱמֹרִ֔י וְאֵ֖ת הַגִּרְגָּשִֽׁי׃ וְאֶת־הַֽחִוִּ֥י וְאֶת־הַֽעַרְקִ֖י וְאֶת־הַסִּינִֽי׃ וְאֶת־הָֽאַרְוָדִ֥י וְאֶת־הַצְּמָרִ֖י וְאֶת־הַֽחֲמָתִ֑י | Genesis 10:15 | Genesis 10:18" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "indent(reset=True)\n", "info('Get sentences that span multiple verses')\n", "\n", "spanSentences = []\n", "for s in F.otype.s('sentence'):\n", " words = L.d(s, otype='word')\n", " fw = words[0]\n", " lw = words[-1]\n", " fVerse = L.u(fw, otype='verse')[0]\n", " lVerse = L.u(lw, otype='verse')[0]\n", " if fVerse != lVerse:\n", " spanSentences.append((s, fVerse, lVerse))\n", "\n", "info('Found {} cases'.format(len(spanSentences)))\n", "B.table(spanSentences, end=10, linked=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can zoom in:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:03.251841Z", "start_time": "2018-05-18T09:20:03.227631Z" } }, "outputs": [ { "data": { "text/markdown": [ "\n", "##### Result 6\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", "\n", "
sentence \n", "
\n", "
\n", "\n", "
\n", "\n", "
clause NA\n", " xQtX\n", "
\n", "
\n", "\n", "
\n", "\n", "
phrase Time\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
in
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
bone
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
day
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
this
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Pred\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
come
\n", "
qal
\n", "
perf
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
Noah
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Shem
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Ham
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Japheth
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
son
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Noah
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
woman
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Noah
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
three
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
woman
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
son
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
together with
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Cmpl\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
ark
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "B.show(spanSentences, condensed=False, start=6, end=6)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:05.180376Z", "start_time": "2018-05-18T09:20:05.164837Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "\n", "
sentence \n", "
\n", "
\n", "\n", "
\n", "\n", "
clause NA\n", " xQtX\n", "
\n", "
\n", "\n", "
\n", "\n", "
phrase Time\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
in
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
bone
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
day
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
this
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Pred\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
come
\n", "
qal
\n", "
perf
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
Noah
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Shem
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Ham
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Japheth
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
son
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Noah
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
woman
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
Noah
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
three
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
woman
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
son
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PrNP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
together with
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Cmpl\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
ark
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
clause NA\n", " Ellp\n", "
\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
they
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
whole
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
wild animal
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
kind
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
whole
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
cattle
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
kind
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
whole
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
creeping animals
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
clause Attr\n", " Ptcp\n", "
\n", "
\n", "\n", "
\n", "\n", "
phrase Rela\n", " CP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase PreC\n", " VP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
creep
\n", "
qal
\n", "
ptca
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Cmpl\n", " PP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
upon
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
earth
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
clause NA\n", " Ellp\n", "
\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
kind
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
and
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
whole
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
the
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
birds
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
to
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
kind
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
phrase Subj\n", " PPrP\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
whole
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
bird
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
whole
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
wing
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "B.pretty(spanSentences[5][0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Ketiv Qere\n", "Let us explore where Ketiv/Qere pairs are and how they render." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:09.687854Z", "start_time": "2018-05-18T09:20:09.498982Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1892 qeres\n", "3897: ketiv = \"*HWY>\"+\" \" qere = \"HAJ:Y;74>\"+\" \"\n", "4420: ketiv = \"*>HLH\"+\" \" qere = \">@H:@LO75W\"+\"00\"\n", "5645: ketiv = \"*>HLH\"+\" \" qere = \">@H:@LO92W\"+\" \"\n", "5912: ketiv = \"*>HLH\"+\" \" qere = \">@95H:@LOW03\"+\" \"\n", "6246: ketiv = \"*YBJJM\"+\" \" qere = \"Y:BOWJI80m\"+\" \"\n", "6354: ketiv = \"*YBJJM\"+\" \" qere = \"Y:BOWJI80m\"+\" \"\n", "11761: ketiv = \"*W-\"+\"\" qere = \"WA\"+\"\"\n", "11762: ketiv = \"*JJFM\"+\" \" qere = \"J.W.FA70m\"+\" \"\n", "12783: ketiv = \"*GJJM\"+\" \" qere = \"GOWJIm03\"+\" \"\n", "13684: ketiv = \"*YJDH\"+\" \" qere = \"Y@75JID\"+\"00\"\n" ] } ], "source": [ "qeres = [w for w in F.otype.s('word') if F.qere.v(w) != None]\n", "print('{} qeres'.format(len(qeres)))\n", "for w in qeres[0:10]:\n", " print('{}: ketiv = \"{}\"+\"{}\" qere = \"{}\"+\"{}\"'.format(\n", " w, F.g_word.v(w), F.trailer.v(w), F.qere.v(w), F.qere_trailer.v(w),\n", " ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show a ketiv-qere pair\n", "Let us print all text representations of the verse in which word node 4419 occurs." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:11.158371Z", "start_time": "2018-05-18T09:20:11.149950Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 9:21\n", "text-orig-full וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אָהֳלֹֽו׃\n", "text-orig-full-ketiv וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אהלה \n", "text-orig-plain וישׁת מן־היין וישׁכר ויתגל בתוך אהלה \n", "text-phono-full wayyˌēšt min-hayyˌayin wayyiškˈār wayyiṯgˌal bᵊṯˌôḵ *ʔohᵒlˈô .\n", "text-trans-full WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: >@H:@LO75W00\n", "text-trans-full-ketiv WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: *>HLH \n", "text-trans-plain WJCT MN&HJJN WJCKR WJTGL BTWK >HLH \n" ] } ], "source": [ "refWord = 4419\n", "vn = L.u(refWord, otype='verse')[0]\n", "ws = L.d(vn, otype='word')\n", "print('{} {}:{}'.format(*T.sectionFromNode(refWord)))\n", "for fmt in sorted(T.formats):\n", " if fmt.startswith('text-'):\n", " print('{:<25} {}'.format(fmt, T.text(ws, fmt=fmt)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Edge features: mother\n", "\n", "We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet,\n", "the edges point from one row to another.\n", "\n", "One edge we have encountered: the special feature `oslots`.\n", "Each non-slot node is linked by `oslots` to all of its slot nodes.\n", "\n", "An edge is really a feature as well.\n", "Whereas a node feature is a column of information,\n", "one cell per node, \n", "an edge feature is also a column of information, one cell per pair of nodes.\n", "\n", "Linguists use more relationships between textual objects, for example:\n", "linguistic dependency.\n", "In the BHSA all cases of linguistic dependency are coded in the edge feature `mother`.\n", "\n", "Let us do a few basic enquiry on an edge feature:\n", "[mother](https://etcbc.github.io/bhsa/features/hebrew/2017/mother).\n", "\n", "We count how many mothers nodes can have (it turns to be 0 or 1).\n", "We walk through all nodes and per node we retrieve the mother nodes, and\n", "we store the lengths (if non-zero) in a dictionary (`mother_len`).\n", "\n", "We see that nodes have at most one mother.\n", "\n", "We also count the inverse relationship: daughters." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:24.066854Z", "start_time": "2018-05-18T09:20:20.609907Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 16s Counting mothers\n", " 19s 182159 nodes have mothers\n", " 19s 144059 nodes have daughters\n", "mothers Counter({1: 182159})\n", "daughters Counter({1: 117926, 2: 17408, 3: 6272, 4: 1843, 5: 462, 6: 122, 7: 21, 8: 5})\n" ] } ], "source": [ "info('Counting mothers')\n", "\n", "motherLen = {}\n", "daughterLen = {}\n", "\n", "for c in N():\n", " lms = E.mother.f(c) or []\n", " lds = E.mother.t(c) or []\n", " nms = len(lms)\n", " nds = len(lds)\n", " if nms: motherLen[c] = nms\n", " if nds: daughterLen[c] = nds\n", "\n", "info('{} nodes have mothers'.format(len(motherLen)))\n", "info('{} nodes have daughters'.format(len(daughterLen)))\n", "\n", "motherCount = collections.Counter()\n", "daughterCount = collections.Counter()\n", "\n", "for (n, lm) in motherLen.items(): motherCount[lm] += 1\n", "for (n, ld) in daughterLen.items(): daughterCount[ld] += 1\n", "\n", "print('mothers', motherCount)\n", "print('daughters', daughterCount)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Next steps\n", "\n", "By now you have an impression how to compute around in the Hebrew Bible.\n", "While this is still the beginning, I hope you already sense the power of unlimited programmatic access\n", "to all the bits and bytes in the data set.\n", "\n", "Here are a few directions for unleashing that power.\n", "\n", "## Explore additional data\n", "The ETCBC has a few other repositories with data that work in conjunction with the BHSA data.\n", "One of them you have already seen: \n", "[phono](https://github.com/ETCBC/phono),\n", "for phonetic transcriptions.\n", "\n", "There is also\n", "[parallels](https://github.com/ETCBC/parallels)\n", "for detecting parallel passages,\n", "and\n", "[valence](https://github.com/ETCBC/valence)\n", "for studying patterns around verbs that determine their meanings.\n", "\n", "## Add your own data\n", "If you study the additional data, you can observe how that data is created and also\n", "how it is turned into a text-fabric data module.\n", "The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers\n", "and the values string or numbers as a Text-Fabric feature.\n", "When you are creating data, you have already constructed those dictionaries, so writing\n", "them out is just one method call.\n", "See for example how the\n", "[flowchart](https://github.com/ETCBC/valence/blob/master/programs/flowchart.ipynb#Add-sense-feature-to-valence-module)\n", "notebook in valence writes out verb sense data.\n", "![flow](images/valence.png)\n", "\n", "You can then easily share your new features on GitHub, so that your colleagues everywhere \n", "can try it out for themselves." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export to Emdros MQL\n", "\n", "[EMDROS](http://emdros.org), written by Ulrik Petersen,\n", "is a text database system with the powerful *topographic* query language MQL.\n", "The ideas are based on a model devised by Christ-Jan Doedens in\n", "[Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).\n", "\n", "Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.\n", "\n", "[SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.\n", "\n", "So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.\n", "\n", "If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`,\n", "which we will not show here.\n", "\n", "And if you want to export a Text-Fabric data set to MQL, that is also possible.\n", "\n", "After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the\n", "indicated modules into a big MQL dump, which can be imported by an EMDROS database." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2018-02-15T09:27:12.673630Z", "start_time": "2018-02-15T09:25:52.241804Z" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking features of dataset mybhsa\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " | 0.00s feature \"book@am\" => \"book_am\"\n", " | 0.00s feature \"book@ar\" => \"book_ar\"\n", " | 0.00s feature \"book@bn\" => \"book_bn\"\n", " | 0.00s feature \"book@da\" => \"book_da\"\n", " | 0.00s feature \"book@de\" => \"book_de\"\n", " | 0.00s feature \"book@el\" => \"book_el\"\n", " | 0.00s feature \"book@en\" => \"book_en\"\n", " | 0.00s feature \"book@es\" => \"book_es\"\n", " | 0.00s feature \"book@fa\" => \"book_fa\"\n", " | 0.00s feature \"book@fr\" => \"book_fr\"\n", " | 0.00s feature \"book@he\" => \"book_he\"\n", " | 0.00s feature \"book@hi\" => \"book_hi\"\n", " | 0.00s feature \"book@id\" => \"book_id\"\n", " | 0.00s feature \"book@ja\" => \"book_ja\"\n", " | 0.00s feature \"book@ko\" => \"book_ko\"\n", " | 0.00s feature \"book@la\" => \"book_la\"\n", " | 0.00s feature \"book@nl\" => \"book_nl\"\n", " | 0.00s feature \"book@pa\" => \"book_pa\"\n", " | 0.00s feature \"book@pt\" => \"book_pt\"\n", " | 0.00s feature \"book@ru\" => \"book_ru\"\n", " | 0.00s feature \"book@sw\" => \"book_sw\"\n", " | 0.00s feature \"book@syc\" => \"book_syc\"\n", " | 0.00s feature \"book@tr\" => \"book_tr\"\n", " | 0.00s feature \"book@ur\" => \"book_ur\"\n", " | 0.00s feature \"book@yo\" => \"book_yo\"\n", " | 0.00s feature \"book@zh\" => \"book_zh\"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.00s M code from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M det from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M dist from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M dist_unit from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M distributional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M domain from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M freq_occ from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M function from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M functional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_nme from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_nme_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_pfm from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_pfm_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_prs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_prs_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_uvf from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_uvf_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_vbe from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_vbe_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_vbs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M g_vbs_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M gn from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M instruction from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M is_root from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M kind from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M kq_hybrid from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M kq_hybrid_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M label from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M languageISO from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M lexeme_count from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M ls from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M mother_object_type from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M nametype from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M nme from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M nu from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.01s M number from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M omap@2016-2017 from /Users/dirk/github/etcbc/bhsa/tf/2017\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " | 0.00s feature \"omap@2016-2017\" => \"omap_2016_2017\"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.00s M otext@phono from /Users/dirk/github/etcbc/phono/tf/2017\n", " | 0.00s M pargr from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M pdp from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M pfm from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M prs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M prs_gn from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M prs_nu from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M prs_ps from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M ps from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M rank_lex from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M rank_occ from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M rela from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M root from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M st from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M suffix_gender from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M suffix_number from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M suffix_person from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M tab from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M txt from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M typ from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M uvf from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M vbe from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M vbs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M voc_lex from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M vs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s M vt from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " 0.22s 114 features to export to MQL ...\n", " 0.22s Loading 114 features\n", " | 0.05s B code from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.20s B det from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.15s B dist from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.21s B dist_unit from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 1.89s B distributional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B domain from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.15s B freq_occ from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.15s B function from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 2.23s B functional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.10s B g_nme from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.20s B g_nme_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.12s B g_pfm from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.11s B g_pfm_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.11s B g_prs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.12s B g_prs_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.08s B g_uvf from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.07s B g_uvf_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.08s B g_vbe from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.08s B g_vbe_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.07s B g_vbs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.07s B g_vbs_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.10s B gn from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B instruction from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B is_root from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B kind from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.10s B kq_hybrid from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.12s B kq_hybrid_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B label from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.20s B languageISO from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B lexeme_count from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.20s B ls from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.10s B mother_object_type from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s B nametype from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.16s B nme from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B nu from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.26s B number from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.81s B omap@2016-2017 from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B pargr from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.15s B pdp from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B pfm from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B prs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.14s B prs_gn from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.14s B prs_nu from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B prs_ps from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B ps from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.09s B rank_lex from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.09s B rank_occ from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.23s B rela from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.00s B root from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.11s B st from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B suffix_gender from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B suffix_number from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.14s B suffix_person from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.02s B tab from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.03s B txt from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.23s B typ from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.12s B uvf from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B vbe from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.13s B vbs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.01s B voc_lex from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.14s B vs from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " | 0.17s B vt from /Users/dirk/github/etcbc/bhsa/tf/2017\n", " 12s Writing enumerations\n", "\tbook_am : 39 values, 39 not a name, e.g. «መኃልየ_መኃልይ_ዘሰሎሞን»\n", "\tbook_ar : 39 values, 39 not a name, e.g. «1_اخبار»\n", "\tbook_bn : 39 values, 39 not a name, e.g. «আদিপুস্তক»\n", "\tbook_da : 39 values, 13 not a name, e.g. «1.Kongebog»\n", "\tbook_de : 39 values, 7 not a name, e.g. «1_Chronik»\n", "\tbook_el : 39 values, 39 not a name, e.g. «Άσμα_Ασμάτων»\n", "\tbook_en : 39 values, 6 not a name, e.g. «1_Chronicles»\n", "\tbook_es : 39 values, 22 not a name, e.g. «1_Crónicas»\n", "\tbook_fa : 39 values, 39 not a name, e.g. «استر»\n", "\tbook_fr : 39 values, 19 not a name, e.g. «1_Chroniques»\n", "\tbook_he : 39 values, 39 not a name, e.g. «איוב»\n", "\tbook_hi : 39 values, 39 not a name, e.g. «1_इतिहास»\n", "\tbook_id : 39 values, 7 not a name, e.g. «1_Raja-raja»\n", "\tbook_ja : 39 values, 39 not a name, e.g. «アモス書»\n", "\tbook_ko : 39 values, 39 not a name, e.g. «나훔»\n", "\tbook_nl : 39 values, 8 not a name, e.g. «1_Koningen»\n", "\tbook_pa : 39 values, 39 not a name, e.g. «1_ਇਤਹਾਸ»\n", "\tbook_pt : 39 values, 21 not a name, e.g. «1_Crônicas»\n", "\tbook_ru : 39 values, 39 not a name, e.g. «1-я_Паралипоменон»\n", "\tbook_sw : 39 values, 6 not a name, e.g. «1_Mambo_ya_Nyakati»\n", "\tbook_syc : 39 values, 39 not a name, e.g. «ܐ_ܒܪܝܡܝܢ»\n", "\tbook_tr : 39 values, 16 not a name, e.g. «1_Krallar»\n", "\tbook_ur : 39 values, 39 not a name, e.g. «احبار»\n", "\tbook_yo : 39 values, 8 not a name, e.g. «Amọsi»\n", "\tbook_zh : 38 values, 37 not a name, e.g. «以斯帖记»\n", "\tdomain : 4 values, 1 not a name, e.g. «?»\n", "\tg_nme : 108 values, 108 not a name, e.g. «»\n", "\tg_nme_utf8 : 106 values, 106 not a name, e.g. «»\n", "\tg_pfm : 87 values, 87 not a name, e.g. «»\n", "\tg_pfm_utf8 : 86 values, 86 not a name, e.g. «»\n", "\tg_prs : 127 values, 127 not a name, e.g. «»\n", "\tg_prs_utf8 : 126 values, 126 not a name, e.g. «»\n", "\tg_uvf : 19 values, 19 not a name, e.g. «»\n", "\tg_uvf_utf8 : 17 values, 17 not a name, e.g. «»\n", "\tg_vbe : 101 values, 101 not a name, e.g. «»\n", "\tg_vbe_utf8 : 97 values, 97 not a name, e.g. «»\n", "\tg_vbs : 66 values, 66 not a name, e.g. «»\n", "\tg_vbs_utf8 : 65 values, 65 not a name, e.g. «»\n", "\tinstruction : 35 values, 20 not a name, e.g. «.#»\n", "\tnametype : 9 values, 4 not a name, e.g. «gens,topo»\n", "\tnme : 20 values, 7 not a name, e.g. «»\n", "\tpfm : 11 values, 4 not a name, e.g. «»\n", "\tphono_trailer : 4 values, 4 not a name, e.g. «»\n", "\tprs : 22 values, 4 not a name, e.g. «H=»\n", "\tqere_trailer : 5 values, 5 not a name, e.g. «»\n", "\tqere_trailer_utf8: 5 values, 5 not a name, e.g. «»\n", "\troot : 648 values, 187 not a name, e.g. «»\n", "\ttrailer : 12 values, 12 not a name, e.g. «»\n", "\ttrailer_utf8 : 12 values, 12 not a name, e.g. «»\n", "\ttxt : 136 values, 59 not a name, e.g. «?»\n", "\tuvf : 6 values, 1 not a name, e.g. «>»\n", "\tvbe : 19 values, 6 not a name, e.g. «»\n", "\tvbs : 11 values, 3 not a name, e.g. «>»\n", " | 2.23s Writing an all-in-one enum with 232 values\n", " 14s Mapping 114 features onto 13 object types\n", " 20s Writing 114 features as data in 13 object types\n", " | 0.00s word data ...\n", " | | 4.58s batch of size 46.6MB with 50000 of 50000 words\n", " | | 9.17s batch of size 46.6MB with 50000 of 100000 words\n", " | | 14s batch of size 46.8MB with 50000 of 150000 words\n", " | | 19s batch of size 46.8MB with 50000 of 200000 words\n", " | | 24s batch of size 47.0MB with 50000 of 250000 words\n", " | | 28s batch of size 47.0MB with 50000 of 300000 words\n", " | | 34s batch of size 47.2MB with 50000 of 350000 words\n", " | | 39s batch of size 47.0MB with 50000 of 400000 words\n", " | | 42s batch of size 24.9MB with 26584 of 426584 words\n", " | 42s word data: 426584 objects\n", " | 0.00s subphrase data ...\n", " | | 0.63s batch of size 7.2MB with 50000 of 50000 subphrases\n", " | | 1.26s batch of size 7.1MB with 50000 of 100000 subphrases\n", " | | 1.44s batch of size 2.0MB with 13784 of 113784 subphrases\n", " | 1.44s subphrase data: 113784 objects\n", " | 0.00s phrase_atom data ...\n", " | | 0.98s batch of size 10.9MB with 50000 of 50000 phrase_atoms\n", " | | 1.94s batch of size 10.9MB with 50000 of 100000 phrase_atoms\n", " | | 2.91s batch of size 11.0MB with 50000 of 150000 phrase_atoms\n", " | | 3.88s batch of size 11.0MB with 50000 of 200000 phrase_atoms\n", " | | 4.87s batch of size 11.0MB with 50000 of 250000 phrase_atoms\n", " | | 5.22s batch of size 3.9MB with 17519 of 267519 phrase_atoms\n", " | 5.22s phrase_atom data: 267519 objects\n", " | 0.00s phrase data ...\n", " | | 0.90s batch of size 9.8MB with 50000 of 50000 phrases\n", " | | 1.79s batch of size 9.9MB with 50000 of 100000 phrases\n", " | | 2.87s batch of size 9.9MB with 50000 of 150000 phrases\n", " | | 3.83s batch of size 9.9MB with 50000 of 200000 phrases\n", " | | 4.73s batch of size 9.9MB with 50000 of 250000 phrases\n", " | | 4.80s batch of size 649.0KB with 3187 of 253187 phrases\n", " | 4.80s phrase data: 253187 objects\n", " | 0.00s clause_atom data ...\n", " | | 1.29s batch of size 13.3MB with 50000 of 50000 clause_atoms\n", " | | 2.31s batch of size 10.8MB with 40669 of 90669 clause_atoms\n", " | 2.32s clause_atom data: 90669 objects\n", " | 0.00s clause data ...\n", " | | 1.24s batch of size 12.2MB with 50000 of 50000 clauses\n", " | | 2.21s batch of size 9.3MB with 38101 of 88101 clauses\n", " | 2.21s clause data: 88101 objects\n", " | 0.00s sentence_atom data ...\n", " | | 0.58s batch of size 6.6MB with 50000 of 50000 sentence_atoms\n", " | | 0.75s batch of size 1.9MB with 14486 of 64486 sentence_atoms\n", " | 0.75s sentence_atom data: 64486 objects\n", " | 0.00s sentence data ...\n", " | | 0.46s batch of size 5.1MB with 50000 of 50000 sentences\n", " | | 0.58s batch of size 1.4MB with 13711 of 63711 sentences\n", " | 0.58s sentence data: 63711 objects\n", " | 0.00s half_verse data ...\n", " | | 0.42s batch of size 4.5MB with 45180 of 45180 half_verses\n", " | 0.42s half_verse data: 45180 objects\n", " | 0.00s verse data ...\n", " | | 0.34s batch of size 3.4MB with 23213 of 23213 verses\n", " | 0.34s verse data: 23213 objects\n", " | 0.00s lex data ...\n", " | | 0.54s batch of size 4.7MB with 9233 of 9233 lexs\n", " | 0.54s lex data: 9233 objects\n", " | 0.00s chapter data ...\n", " | | 0.07s batch of size 110.2KB with 929 of 929 chapters\n", " | 0.07s chapter data: 929 objects\n", " | 0.00s book data ...\n", " | | 0.05s batch of size 28.3KB with 39 of 39 books\n", " | 0.05s book data: 39 objects\n", " 1m 20s Done\n" ] } ], "source": [ "TF.exportMQL('mybhsa','~/Downloads')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have a file `~/Downloads/mybhsa.mql` of 530 MB.\n", "You can import it into an Emdros database by saying:\n", "\n", " cd ~/Downloads\n", " rm mybhsa.mql\n", " mql -b 3 < mybhsa.mql\n", " \n", "The result is an SQLite3 database `mybhsa` in the same directory (168 MB).\n", "You can run a query against it by creating a text file test.mql with this contents:\n", "\n", " select all objects where\n", " [lex gloss ~ 'make'\n", " [word FOCUS]\n", " ]\n", " \n", "And then say\n", "\n", " mql -b 3 -d mybhsa test.mql\n", " \n", "You will see raw query results: all word occurrences that belong to lexemes with `make` in their gloss.\n", " \n", "It is not very pretty, and probably you should use a more visual Emdros tool to run those queries.\n", "You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clean caches\n", "\n", "Text-Fabric pre-computes data for you, so that it can be loaded faster.\n", "If the original data is updated, Text-Fabric detects it, and will recompute that data.\n", "\n", "But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might\n", "want to clear the cache of precomputed results.\n", "\n", "There are two ways to do that:\n", "\n", "* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.\n", " This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.\n", "* Call `TF.clearCache()`, which does exactly the same.\n", "\n", "It is not handy to execute the following cell all the time, that's why I have commented it out.\n", "So if you really want to clear the cache, remove the comment sign below." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TF.clearCache()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }