{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "You might want to consider the [start](search.ipynb) of this tutorial.\n", "\n", "Short introductions to other TF datasets:\n", "\n", "* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),\n", "* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),\n", "or the\n", "* [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:39.818664Z", "start_time": "2018-05-24T10:06:39.796588Z" } }, "outputs": [], "source": [ "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:41.254515Z", "start_time": "2018-05-24T10:06:41.238046Z" } }, "outputs": [], "source": [ "VERSION = \"2021\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/etcbc/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/etcbc/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/etcbc/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/etcbc/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.6, etcbc/bhsa/app v3, Search Reference
\n", " Data: etcbc - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rough edges\n", "\n", "It might be helpful to peek under the hood, especially when exploring searches that go slow.\n", "\n", "If you went through the previous parts of the tutorial you have encountered cases where things come\n", "to a grinding halt.\n", "\n", "Yet we can get a hunch of what is going on, even in those cases.\n", "For that, we use the lower-level search API `S` of Text-Fabric, and not the\n", "wrappers that the high level `A` API provides.\n", "\n", "The main difference is, that `S.search()` returns a *generator* of the results,\n", "whereas `A.search()` returns a list of the results.\n", "In fact, `A.search()` calls the generator function delivered by `S.search()` as often as needed.\n", "\n", "For some queries, the fetching of results is quite costly, so costly that we do not want to fetch\n", "all results up-front. Rather we want to fetch a few, to see how it goes.\n", "In these cases, directly using `S.search()` is preferred over `A.search()`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:49:43.870215Z", "start_time": "2018-05-24T08:49:43.866722Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "book\n", " chapter\n", " verse\n", " phrase det=und\n", " word lex=>LHJM/\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Study\n", "\n", "First we call `S.study(query)`.\n", "\n", "The syntax will be checked, features loaded, the search space will be set up, narrowed down,\n", "and the fetching of results will be prepared, but not yet executed.\n", "\n", "In order to make the query a bit more interesting, we lift the constraint that the results must be in Genesis 1-2." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:49:46.451394Z", "start_time": "2018-05-24T08:49:45.192096Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 5 objects ...\n", " 0.25s Constraining search space with 4 relations ...\n", " 0.29s \t2 edges thinned\n", " 0.29s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.32s Ready to deliver results from 3345 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we rush to the results, lets have a look at the *plan*." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:49:49.104091Z", "start_time": "2018-05-24T08:49:49.088781Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.47s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 book\n", " 2 R1 chapter\n", " 3 R2 verse\n", " 4 R3 phrase det=und\n", " 5 R4 word lex=>LHJM\n", " 6 \n" ] } ], "source": [ "S.showPlan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see already what your results will look like.\n", "Each result `r` is a *tuple* of nodes:\n", "```\n", "(R0, R1, R2, R3, R4)\n", "```\n", "that instantiate the objects in your template.\n", "\n", "In case you are curious, you can get details about the search space as well:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:03.622134Z", "start_time": "2018-05-24T08:50:03.589828Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 5 objects and 4 relations\n", "Results are instantiations of the following objects:\n", "node 0-book 39 choices\n", "node 1-chapter 929 choices\n", "node 2-verse 754 choices\n", "node 3-phrase 805 choices\n", "node 4-word 818 choices\n", "Performance parameters:\n", "\tyarnRatio = 1.25\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "Instantiations are computed along the following relations:\n", "node 0-book 39 choices\n", "edge 0-book [[ 1-chapter 23.8 choices\n", "edge 1-chapter [[ 2-verse 1.0 choices\n", "edge 2-verse [[ 3-phrase 1.1 choices (thinned)\n", "edge 3-phrase [[ 4-word 1.0 choices (thinned)\n", " 3.15s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 book\n", " 2 R1 chapter\n", " 3 R2 verse\n", " 4 R3 phrase det=und\n", " 5 R4 word lex=>LHJM\n", " 6 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The part about the *nodes* shows you how many possible instantiations for each object in your template\n", "has been found.\n", "These are not results yet, because only combinations of instantiations\n", "that satisfy all constraints are results.\n", "\n", "The constraints come from the relations between the objects that you specified.\n", "In this case, there is only an implicit relation: embedding `[[`.\n", "Later on we'll examine all\n", "[spatial relations](https://annotation.github.io/text-fabric/tf/about/searchusage.html#relational-operators).\n", "\n", "The part about the *edges* shows you the constraints,\n", "and in what order they will be computed when stitching results together.\n", "In this case the order is exactly the order by which the relations appear in the template,\n", "but that will not always be the case.\n", "Text-Fabric spends some time and ingenuity to find out an optimal *stitch plan*.\n", "Fetching results is like selecting a node, stitching it to another node with an edge,\n", "and so on, until a full stitch of nodes intersects with all the node sets from which they\n", "must be chosen (the yarns).\n", "\n", "Fetching results may take time.\n", "\n", "For some queries, it can take a large amount of time to walk through all results.\n", "Even worse, it may happen that it takes a large amount of time before getting the *first* result.\n", "During stitching, many stitchings will be tried and fail before they can be completed.\n", "\n", "This has to do with search strategies on the one hand,\n", "and the very likely possibility to encounter *pathological* search patterns,\n", "which have billions of results, mostly unintended.\n", "For example, a simple query that asks for 5 words in the Hebrew Bible without further constraints,\n", "will have 425,000 to the power of 5 results.\n", "That is 10-e28 (a one with 28 zeros),\n", "roughly the number of molecules in a few hundred litres of air.\n", "That may not sound much, but it is 10,000 times the amount of bytes\n", "that can be currently stored on the whole Internet.\n", "\n", "Text-Fabric search is not yet done with finding optimal search strategies,\n", "and I hope to refine its arsenal of methods in the future, depending on what you report." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counting results\n", "It is always a good idea to get a feel for the amount of results, before you dive into them head-on." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:45.871673Z", "start_time": "2018-05-24T08:50:45.847217Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 5 ...\n", " | 0.00s 1\n", " | 0.00s 2\n", " | 0.00s 3\n", " | 0.00s 4\n", " | 0.00s 5\n", " 0.01s Done: 6 results\n" ] } ], "source": [ "S.count(progress=1, limit=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We asked for 5 results in total, with a progress message for every one.\n", "That was a bit conservative." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:48.710519Z", "start_time": "2018-05-24T08:50:48.598126Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to 500 ...\n", " | 0.01s 100\n", " | 0.02s 200\n", " | 0.03s 300\n", " | 0.04s 400\n", " | 0.05s 500\n", " 0.05s Done: 501 results\n" ] } ], "source": [ "S.count(progress=100, limit=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still pretty quick, now we want to count all results." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:50.003468Z", "start_time": "2018-05-24T08:50:49.859589Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 200 ...\n", " | 0.02s 200\n", " | 0.04s 400\n", " | 0.06s 600\n", " | 0.07s 800\n", " 0.07s Done: 818 results\n" ] } ], "source": [ "S.count(progress=200, limit=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetching results\n", "\n", "It is time to see something of those results." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:51:06.009618Z", "start_time": "2018-05-24T08:51:05.993571Z" } }, "outputs": [ { "data": { "text/plain": [ "((426626, 427478, 1435353, 882995, 381820),\n", " (426626, 427478, 1435364, 883090, 382059),\n", " (426627, 427485, 1435532, 884992, 385801),\n", " (426627, 427486, 1435548, 885229, 386188),\n", " (426627, 427492, 1435804, 887032, 390487),\n", " (426627, 427493, 1435830, 887367, 391119),\n", " (426627, 427493, 1435831, 887394, 391159),\n", " (426628, 427497, 1435979, 888253, 392968),\n", " (426628, 427498, 1436032, 888574, 393786),\n", " (426628, 427498, 1436037, 888618, 393895))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "S.fetch(limit=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not very informative.\n", "Just a quick observation: look at the last column.\n", "These are the result nodes for the `word` part in the query, indicated as `R7` by `showPlan()` before.\n", "And indeed, they are all below 425,000, the number of words in the Hebrew Bible.\n", "\n", "Nevertheless, we want to glean a bit more information off them." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:51:14.214022Z", "start_time": "2018-05-24T08:51:14.118975Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Ezra 8:17 phrase[מְשָׁרְתִ֖ים לְבֵ֥ית אֱלֹהֵֽינוּ׃ ] אֱלֹהֵֽינוּ׃ \n", " Ezra 8:28 phrase[נְדָבָ֔ה לַיהוָ֖ה אֱלֹהֵ֥י אֲבֹתֵיכֶֽם׃ ] אֱלֹהֵ֥י \n", " Nehemiah 5:15 phrase[מִפְּנֵ֖י יִרְאַ֥ת אֱלֹהִֽים׃ ] אֱלֹהִֽים׃ \n", " Nehemiah 6:12 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Nehemiah 12:46 phrase[לֵֽאלֹהִֽים׃ ] אלֹהִֽים׃ \n", " Nehemiah 13:25 phrase[בֵּֽאלֹהִ֗ים ] אלֹהִ֗ים \n", " Nehemiah 13:26 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " 1_Chronicles 4:10 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " 1_Chronicles 5:20 phrase[לֵאלֹהִ֤ים ] אלֹהִ֤ים \n", " 1_Chronicles 5:25 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n" ] } ], "source": [ "for r in S.fetch(limit=10):\n", " print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Caution\n", "> It is not possible to do `len(S.fetch())`.\n", "Because `fetch()` is a *generator*, not a list.\n", "It will deliver a result every time it is being asked and for as long as there are results,\n", "but it does not know in advance how many there will be.\n", "\n", ">Fetching a result can be costly, because due to the constraints, a lot of possibilities\n", "may have to be tried and rejected before a the next result is found.\n", "\n", "> That is why you often see results coming in at varying speeds when counting them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use `A.table()` to make a list of results.\n", "This function is part of the `Bhsa` API, not of the generic Text-Fabric machinery, as opposed to `S.glean()`.\n", "\n", "So, you can use `S.glean()` for every Text-Fabric corpus, but the output is still not very nice.\n", "`A.table()` gives much nicer output." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:51:19.071119Z", "start_time": "2018-05-24T08:51:18.970211Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
npbookchapterversephraseword
1Ezra 8:17EzraEzra 8מְשָׁרְתִ֖ים לְבֵ֥ית אֱלֹהֵֽינוּ׃ אֱלֹהֵֽינוּ׃
2Ezra 8:28EzraEzra 8נְדָבָ֔ה לַיהוָ֖ה אֱלֹהֵ֥י אֲבֹתֵיכֶֽם׃ אֱלֹהֵ֥י
3Nehemiah 5:15NehemiahNehemiah 5מִפְּנֵ֖י יִרְאַ֥ת אֱלֹהִֽים׃ אֱלֹהִֽים׃
4Nehemiah 6:12NehemiahNehemiah 6אֱלֹהִ֖ים אֱלֹהִ֖ים
5Nehemiah 12:46NehemiahNehemiah 12לֵֽאלֹהִֽים׃ אלֹהִֽים׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(S.fetch(limit=5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Queries with abundant results\n", "\n", "Above we mentioned that there are queries with astronomically many results.\n", "Here we present one:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "word\n", "# word\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are asking for any pair of different words. That will give roughly 425,000 * 425,000 results,\n", "which is 180 billion results. \n", "This is a lot to produce, it will take time on even the best of computers,\n", "and once you've got the results, what would you do with them.\n", "Let's see what happens if we count these results." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 2 objects ...\n", " 0.10s Constraining search space with 1 relations ...\n", " 0.10s \t0 edges thinned\n", " 0.10s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.10s Ready to deliver results from 853180 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 500000 ...\n", " | 0.16s 500000\n", " | 0.31s 1000000\n", " | 0.46s 1500000\n", " | 0.61s 2000000\n", " | 0.76s 2500000\n", " | 0.91s 3000000\n", " | 1.05s 3500000\n", " | 1.20s 4000000\n", " | 1.35s 4500000\n", " | 1.50s 5000000\n", " | 1.65s 5500000\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 1.74s cut off at 5787324 results. There are more ...\n" ] } ], "source": [ "S.count(progress=500000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text-fabric has cut off the process at a certain limit.\n", "This limit is a number of times the maximum node in your corpus:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "5787324 / F.otype.maxNode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you really need more results than this limit, you can specify a higher limit:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 500000 up to 11574648 ...\n", " | 0.16s 500000\n", " | 0.31s 1000000\n", " | 0.46s 1500000\n", " | 0.61s 2000000\n", " | 0.76s 2500000\n", " | 0.90s 3000000\n", " | 1.05s 3500000\n", " | 1.20s 4000000\n", " | 1.35s 4500000\n", " | 1.50s 5000000\n", " | 1.65s 5500000\n", " | 1.80s 6000000\n", " | 1.94s 6500000\n", " | 2.09s 7000000\n", " | 2.24s 7500000\n", " | 2.39s 8000000\n", " | 2.54s 8500000\n", " | 2.69s 9000000\n", " | 2.84s 9500000\n", " | 2.99s 10000000\n", " | 3.14s 10500000\n", " | 3.28s 11000000\n", " | 3.43s 11500000\n", " 3.46s Done: 11574649 results\n" ] } ], "source": [ "S.count(progress=500000, limit=8 * F.otype.maxNode)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you do not get an error message, because you got what you've asked for." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, in the advanced interface, let's fetch the standard maximum of results:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 3.11s cut off at 5787324 results. There are more ...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " 4.92s 5787324 results\n" ] } ], "source": [ "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, with a modified limit:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6.45s 7234155 results\n" ] } ], "source": [ "results = A.search(query, limit=5 * F.otype.maxNode)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, you do not get an error message, because you got what you've asked for." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slow queries\n", "\n", "The search template above has some pretty tight constraints on one of its objects,\n", "so the amount of data to deal with is pretty limited.\n", "\n", "If the constraints are weak, search may become slow.\n", "\n", "For example, here is a query that looks for pairs of phrases in the same clause in such a way that\n", "one is engulfed by the other." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:35.245388Z", "start_time": "2018-05-24T08:53:35.241669Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "% test\n", "% verse book=Genesis chapter=2 verse=25\n", "verse\n", " clause\n", "\n", " p1:phrase\n", " w1:word\n", " w3:word\n", " w1 < w3\n", "\n", " p2:phrase\n", " w2:word\n", " w1 < w2\n", " w3 > w2\n", "\n", " p1 < p2\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of remarks you may have encountered before.\n", "\n", "* some objects have got a name\n", "* there are additional relations specified between named objects\n", "* `<` means: *comes before*, and `>`: *comes after* in the canonical order for nodes,\n", " which for words means: comes textually before/after, but for other nodes the meaning\n", " is explained [here](https://annotation.github.io/text-fabric/tf/core/nodes.html)\n", "* later on we describe those relations in more detail\n", "\n", "> **Note on order**\n", "Look at the words `w1` and `w3` below phrase `p1`.\n", "Although in the template `w1` comes before `w3`, this is not\n", "translated in a search constraint of the same nature.\n", "\n", "> Order between objects in a template is never significant, only embedding is.\n", "\n", "Because order is not significant, you have to specify order yourself, using relations.\n", "\n", "It turns out that this is better than the other way around.\n", "In MQL order *is* significant, and it is very difficult to\n", "search for `w1` and `w2` in any order.\n", "Especially if your are looking for more than 2 complex objects with lots of feature\n", "conditions, your search template would explode if you had to spell out all\n", "possible permutations. See the example of Reinoud Oosting below.\n", "\n", "> **Note on gaps**\n", "Look at the phrases `p1` and `p2`.\n", "We do not specify an order here, only that they are different.\n", "In order to prevent duplicated searches with `p1` and `p2` interchanged, we even\n", "stipulate that `p1 < p2`.\n", "There are many spatial relationships possible between different objects.\n", "In many cases, neither the one comes before the other, nor vice versa.\n", "They can overlap, one can occur in a gap of the other, they can be completely disjoint\n", "and interleaved, etc." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# ignore this\n", "# S.tweakPerformance(yarnRatio=2)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:38.402967Z", "start_time": "2018-05-24T08:53:37.837161Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 7 objects ...\n", " 0.22s Constraining search space with 10 relations ...\n", " 0.80s \t6 edges thinned\n", " 0.80s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.84s Ready to deliver results from 1894471 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text-Fabric knows that narrowing down the search space in this case would take ages,\n", "without resulting in a significantly shrunken space.\n", "So it skips doing so for most constraints.\n", "\n", "Let us see the plan, with details." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 7 objects and 10 relations\n", "Results are instantiations of the following objects:\n", "node 0-verse 23207 choices\n", "node 1-clause 88081 choices\n", "node 2-phrase 252998 choices\n", "node 3-word 425729 choices\n", "node 4-word 425729 choices\n", "node 5-phrase 252998 choices\n", "node 6-word 425729 choices\n", "Performance parameters:\n", "\tyarnRatio = 1.25\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "Instantiations are computed along the following relations:\n", "node 0-verse 23207 choices\n", "edge 0-verse [[ 1-clause 4.4 choices (thinned)\n", "edge 1-clause [[ 5-phrase 2.8 choices (thinned)\n", "edge 5-phrase [[ 6-word 1.6 choices (thinned)\n", "edge 1-clause [[ 2-phrase 3.2 choices (thinned)\n", "edge 5-phrase > 2-phrase 0 choices\n", "edge 2-phrase [[ 4-word 1.7 choices (thinned)\n", "edge 6-word < 4-word 0 choices\n", "edge 2-phrase [[ 3-word 1.9 choices (thinned)\n", "edge 6-word > 3-word 0 choices\n", "edge 3-word < 4-word 0 choices\n", " 6.61s The results are connected to the original search template as follows:\n", " 0 \n", " 1 % test\n", " 2 % verse book=Genesis chapter=2 verse=25\n", " 3 R0 verse\n", " 4 R1 clause\n", " 5 \n", " 6 R2 p1:phrase\n", " 7 R3 w1:word\n", " 8 R4 w3:word\n", " 9 w1 < w3\n", "10 \n", "11 R5 p2:phrase\n", "12 R6 w2:word\n", "13 w1 < w2\n", "14 w3 > w2\n", "15 \n", "16 p1 < p2\n", "17 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, we have a hefty search space here.\n", "Let us play with the `count()` function." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:43.176732Z", "start_time": "2018-05-24T08:53:42.972063Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 10 up to 100 ...\n", " | 0.02s 10\n", " | 0.02s 20\n", " | 0.02s 30\n", " | 0.03s 40\n", " | 0.03s 50\n", " | 0.03s 60\n", " | 0.03s 70\n", " | 0.03s 80\n", " | 0.03s 90\n", " | 0.03s 100\n", " 0.03s Done: 101 results\n" ] } ], "source": [ "S.count(progress=10, limit=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can be bolder than this!" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:45.993373Z", "start_time": "2018-05-24T08:53:45.182241Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to 1000 ...\n", " | 0.03s 100\n", " | 0.03s 200\n", " | 0.04s 300\n", " | 0.06s 400\n", " | 0.08s 500\n", " | 0.08s 600\n", " | 0.09s 700\n", " | 0.12s 800\n", " | 0.12s 900\n", " | 0.16s 1000\n", " 0.16s Done: 1001 results\n" ] } ], "source": [ "S.count(progress=100, limit=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, not too bad, but note that it takes a big fraction of a second to get just 100 results.\n", "\n", "Now let us go for all of them by the thousand." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:59.440736Z", "start_time": "2018-05-24T08:53:51.899813Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1000 ...\n", " | 0.15s 1000\n", " | 0.23s 2000\n", " | 0.32s 3000\n", " | 0.40s 4000\n", " | 0.50s 5000\n", " | 0.68s 6000\n", " | 1.07s 7000\n", " 1.27s Done: 7593 results\n" ] } ], "source": [ "S.count(progress=1000, limit=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See? This is substantial work." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:54:02.778931Z", "start_time": "2018-05-24T08:54:02.657595Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
npverseclausephrasewordwordphraseword
1Genesis 2:25וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ הָֽעֲרוּמִּ֔ים עֲרוּמִּ֔ים
2Genesis 2:25וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ אָדָ֖ם עֲרוּמִּ֔ים עֲרוּמִּ֔ים
3Genesis 2:25וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ וְעֲרוּמִּ֔ים עֲרוּמִּ֔ים
4Genesis 2:25וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ אִשְׁתֹּ֑ו עֲרוּמִּ֔ים עֲרוּמִּ֔ים
5Genesis 4:4וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא מִבְּכֹרֹ֥ות צֹאנֹ֖ו וּמֵֽחֶלְבֵהֶ֑ן הֶ֨בֶל גַם־ה֛וּא הֶ֨בֶל גַם־הֵבִ֥יא הֵבִ֥יא
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(S.fetch(limit=5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hand-coding\n", "\n", "As a check, here is some code that looks for basically the same phenomenon:\n", "a phrase within the gap of another phrase.\n", "It does not use search, and it gets a bit more focused results, in half the time compared\n", "to the search with the template.\n", "\n", "> **Hint**\n", "If you are comfortable with programming, and what you look for is fairly generic,\n", "you may be better off without search, provided you can translate your insight in the\n", "data into an effective procedure within Text-Fabric.\n", "But wait till we are completely done with this example!" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:54:13.108437Z", "start_time": "2018-05-24T08:54:10.074685Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Getting gapped phrases\n", " 0.79s 368 results\n" ] } ], "source": [ "TF.indent(reset=True)\n", "TF.info(\"Getting gapped phrases\")\n", "results = []\n", "for v in F.otype.s(\"verse\"):\n", " for c in L.d(v, otype=\"clause\"):\n", " ps = L.d(c, otype=\"phrase\")\n", " first = {}\n", " last = {}\n", " slots = {}\n", " # make index of phrase boundaries\n", " for p in ps:\n", " words = L.d(p, otype=\"word\")\n", " first[p] = words[0]\n", " last[p] = words[-1]\n", " slots[p] = set(words)\n", " for p1 in ps:\n", " for p2 in ps:\n", " if p2 < p1:\n", " continue\n", " if len(slots[p1] & slots[p2]) != 0:\n", " continue\n", " if first[p1] < first[p2] and last[p2] < last[p1]:\n", " results.append(\n", " (v, c, p1, p2, first[p1], first[p2], last[p2], last[p1])\n", " )\n", "TF.info(\"{} results\".format(len(results)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pretty printing\n", "\n", "We can use the pretty printing of `A.table()` and `A.show()` here as well, even though we have\n", "not used search!\n", "\n", "Not that you can show the node numbers. In this case it helps to see where the gaps are." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:54:34.120441Z", "start_time": "2018-05-24T08:54:34.112139Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
npverseclausephrasephrasewordwordwordword
1Genesis 2:251414444 427773 וַיִּֽהְי֤וּ 652217 1159 שְׁנֵיהֶם֙ 652218 1160 עֲרוּמִּ֔ים 652217 הָֽאָדָ֖ם וְ1164 אִשְׁתֹּ֑ו 652217 1159 שְׁנֵיהֶם֙ 652217 הָֽאָדָ֖ם וְ1164 אִשְׁתֹּ֑ו 652218 1160 עֲרוּמִּ֔ים 1159 שְׁנֵיהֶם֙ 1160 עֲרוּמִּ֔ים 1160 עֲרוּמִּ֔ים 1164 אִשְׁתֹּ֑ו
2Genesis 4:41414472 427895 וְ652574 1720 הֶ֨בֶל 652575 1721 הֵבִ֥יא 652574 גַם־1723 ה֛וּא מִבְּכֹרֹ֥ות צֹאנֹ֖ו וּמֵֽחֶלְבֵהֶ֑ן 652574 1720 הֶ֨בֶל 652574 גַם־1723 ה֛וּא 652575 1721 הֵבִ֥יא 1720 הֶ֨בֶל 1721 הֵבִ֥יא 1721 הֵבִ֥יא 1723 ה֛וּא
3Genesis 10:211414644 428392 654172 4819 גַּם־ה֑וּא 654173 4821 אֲבִי֙ כָּל־בְּנֵי־4824 עֵ֔בֶר 654172 אֲחִ֖י יֶ֥פֶת הַ4828 גָּדֹֽול׃ 654172 4819 גַּם־ה֑וּא 654172 אֲחִ֖י יֶ֥פֶת הַ4828 גָּדֹֽול׃ 654173 4821 אֲבִי֙ כָּל־בְּנֵי־4824 עֵ֔בֶר 4819 גַּם־4821 אֲבִי֙ 4824 עֵ֔בֶר 4828 גָּדֹֽול׃
4Genesis 12:171414704 428575 וַיְנַגַּ֨ע יְהוָ֧ה׀ 654748 5803 אֶת־פַּרְעֹ֛ה 654749 5805 נְגָעִ֥ים 5806 גְּדֹלִ֖ים 654748 וְאֶת־5809 בֵּיתֹ֑ו עַל־דְּבַ֥ר שָׂרַ֖י אֵ֥שֶׁת אַבְרָֽם׃ 654748 5803 אֶת־פַּרְעֹ֛ה 654748 וְאֶת־5809 בֵּיתֹ֑ו 654749 5805 נְגָעִ֥ים 5806 גְּדֹלִ֖ים 5803 אֶת־5805 נְגָעִ֥ים 5806 גְּדֹלִ֖ים 5809 בֵּיתֹ֑ו
5Genesis 13:11414708 428591 וַיַּעַל֩ 654795 5868 אַבְרָ֨ם 654796 5869 מִ5870 מִּצְרַ֜יִם 654795 ה֠וּא וְאִשְׁתֹּ֧ו וְ5875 כָל־428591 הַנֶּֽגְבָּה׃ 654795 5868 אַבְרָ֨ם 654795 ה֠וּא וְאִשְׁתֹּ֧ו וְ5875 כָל־654796 5869 מִ5870 מִּצְרַ֜יִם 5868 אַבְרָ֨ם 5869 מִ5870 מִּצְרַ֜יִם 5875 כָל־
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

verse
sentence 59
clause WayX NA
sentence 60
clause WxY0 NA
phrase CP Conj
phrase NegP Nega
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, withNodes=True, end=5)\n", "A.show(results, start=1, end=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NB**\n", "Gaps are a tricky phenomenon. In [gaps](searchGaps.ipynb) we will deal with them cruelly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Performance tuning\n", "\n", "Here is an example by Yanniek van der Schans (2018-09-21)." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "c:clause\n", " PreGap:phrase_atom\n", " LastPhrase:phrase_atom\n", " :=\n", "\n", "Gap:clause_atom\n", " :: word\n", "\n", "PreGap < Gap\n", "Gap < LastPhrase\n", "c || Gap\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the current settings of the performance parameters:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Performance parameters, current values:\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "\tyarnRatio = 1.25\n" ] } ], "source": [ "S.tweakPerformance()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 5 objects ...\n", " 0.13s Constraining search space with 8 relations ...\n", " 0.29s \t2 edges thinned\n", " 0.29s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.30s Ready to deliver results from 454184 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 5 objects and 8 relations\n", "Results are instantiations of the following objects:\n", "node 0-clause 88131 choices\n", "node 1-phrase_atom 267532 choices\n", "node 2-phrase_atom 88131 choices\n", "node 3-clause_atom 5195 choices\n", "node 4-word 5195 choices\n", "Performance parameters:\n", "\tyarnRatio = 1.25\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "Instantiations are computed along the following relations:\n", "node 3-clause_atom 5195 choices\n", "edge 3-clause_atom [[ 4-word 1.0 choices\n", "edge 3-clause_atom :: 4-word 0 choices\n", "edge 3-clause_atom < 2-phrase_atom 44065.5 choices\n", "edge 2-phrase_atom := 0-clause 1.0 choices (thinned)\n", "edge 2-phrase_atom ]] 0-clause 0 choices\n", "edge 0-clause || 3-clause_atom 0 choices\n", "edge 0-clause [[ 1-phrase_atom 2.7 choices\n", "edge 1-phrase_atom < 3-clause_atom 0 choices\n", " 0.31s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 c:clause\n", " 2 R1 PreGap:phrase_atom\n", " 3 R2 LastPhrase:phrase_atom\n", " 4 :=\n", " 5 \n", " 6 R3 Gap:clause_atom\n", " 7 R4 :: word\n", " 8 \n", " 9 PreGap < Gap\n", "10 Gap < LastPhrase\n", "11 c || Gap\n", "12 \n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 3 ...\n", " | 0.00s 1\n", " | 0.00s 2\n", " | 1.62s 3\n", " 3.32s Done: 4 results\n" ] } ], "source": [ "S.count(progress=1, limit=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we do better?\n", "\n", "The performance parameter `yarnRatio` can be used to increase the amount of pre-processing, and we can\n", "increase to number of random samples that we make by `tryLimitFrom` and `tryLimitTo`.\n", "\n", "We start with increasing the amount of up-front edge-spinning." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Performance parameters, current values:\n", "\ttryLimitFrom = 10000\n", "\ttryLimitTo = 10000\n", "\tyarnRatio = 0.2\n" ] } ], "source": [ "S.tweakPerformance(yarnRatio=0.2, tryLimitFrom=10000, tryLimitTo=10000)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 5 objects ...\n", " 0.12s Constraining search space with 8 relations ...\n", " 0.41s \t2 edges thinned\n", " 0.41s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.50s Ready to deliver results from 454184 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 5 objects and 8 relations\n", "Results are instantiations of the following objects:\n", "node 0-clause 88131 choices\n", "node 1-phrase_atom 267532 choices\n", "node 2-phrase_atom 88131 choices\n", "node 3-clause_atom 5195 choices\n", "node 4-word 5195 choices\n", "Performance parameters:\n", "\tyarnRatio = 0.2\n", "\ttryLimitFrom = 10000\n", "\ttryLimitTo = 10000\n", "Instantiations are computed along the following relations:\n", "node 3-clause_atom 5195 choices\n", "edge 3-clause_atom [[ 4-word 1.0 choices\n", "edge 3-clause_atom :: 4-word 0 choices\n", "edge 3-clause_atom < 2-phrase_atom 44065.5 choices\n", "edge 2-phrase_atom := 0-clause 1.0 choices (thinned)\n", "edge 2-phrase_atom ]] 0-clause 0 choices\n", "edge 0-clause || 3-clause_atom 0 choices\n", "edge 0-clause [[ 1-phrase_atom 3.0 choices\n", "edge 1-phrase_atom < 3-clause_atom 0 choices\n", " 0.50s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 c:clause\n", " 2 R1 PreGap:phrase_atom\n", " 3 R2 LastPhrase:phrase_atom\n", " 4 :=\n", " 5 \n", " 6 R3 Gap:clause_atom\n", " 7 R4 :: word\n", " 8 \n", " 9 PreGap < Gap\n", "10 Gap < LastPhrase\n", "11 c || Gap\n", "12 \n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems to be the same plan." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 3 ...\n", " | 0.00s 1\n", " | 0.00s 2\n", " | 1.58s 3\n", " 3.29s Done: 4 results\n" ] } ], "source": [ "S.count(progress=1, limit=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No improvement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we decrease the amount of edge spinning?" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Performance parameters, current values:\n", "\ttryLimitFrom = 10000\n", "\ttryLimitTo = 10000\n", "\tyarnRatio = 5\n" ] } ], "source": [ "S.tweakPerformance(yarnRatio=5, tryLimitFrom=10000, tryLimitTo=10000)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 5 objects ...\n", " 0.14s Constraining search space with 8 relations ...\n", " 0.32s \t2 edges thinned\n", " 0.32s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.41s Ready to deliver results from 454184 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 5 objects and 8 relations\n", "Results are instantiations of the following objects:\n", "node 0-clause 88131 choices\n", "node 1-phrase_atom 267532 choices\n", "node 2-phrase_atom 88131 choices\n", "node 3-clause_atom 5195 choices\n", "node 4-word 5195 choices\n", "Performance parameters:\n", "\tyarnRatio = 5\n", "\ttryLimitFrom = 10000\n", "\ttryLimitTo = 10000\n", "Instantiations are computed along the following relations:\n", "node 3-clause_atom 5195 choices\n", "edge 3-clause_atom [[ 4-word 1.0 choices\n", "edge 3-clause_atom :: 4-word 0 choices\n", "edge 3-clause_atom < 2-phrase_atom 44065.5 choices\n", "edge 2-phrase_atom := 0-clause 1.0 choices (thinned)\n", "edge 2-phrase_atom ]] 0-clause 0 choices\n", "edge 0-clause || 3-clause_atom 0 choices\n", "edge 0-clause [[ 1-phrase_atom 3.0 choices\n", "edge 1-phrase_atom < 3-clause_atom 0 choices\n", " 0.42s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 c:clause\n", " 2 R1 PreGap:phrase_atom\n", " 3 R2 LastPhrase:phrase_atom\n", " 4 :=\n", " 5 \n", " 6 R3 Gap:clause_atom\n", " 7 R4 :: word\n", " 8 \n", " 9 PreGap < Gap\n", "10 Gap < LastPhrase\n", "11 c || Gap\n", "12 \n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 3 ...\n", " | 0.00s 1\n", " | 0.00s 2\n", " | 1.61s 3\n", " 3.33s Done: 4 results\n" ] } ], "source": [ "S.count(progress=1, limit=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, no improvement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll look for queries where the parameters matter more in the future." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is how to reset the performance parameters:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Performance parameters, current values:\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "\tyarnRatio = 1.25\n" ] } ], "source": [ "S.tweakPerformance(yarnRatio=None, tryLimitFrom=None, tryLimitTo=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Next\n", "\n", "You have seen cases where the implementation is to blame.\n", "\n", "Now I want to point to gaps in your understanding:\n", "[gaps](searchGaps.ipynb)\n", "\n", "---\n", "\n", "[basic](search.ipynb)\n", "[advanced](searchAdvanced.ipynb)\n", "[sets](searchSets.ipynb)\n", "[relations](searchRelations.ipynb)\n", "[quantifiers](searchQuantifiers.ipynb)\n", "rough\n", "[gaps](searchGaps.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# All steps\n", "\n", "* **[start](start.ipynb)** your first step in mastering the bible computationally\n", "* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures\n", "* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n", "\n", "---\n", "\n", "[advanced](searchAdvanced.ipynb)\n", "[sets](searchSets.ipynb)\n", "[relations](searchRelations.ipynb)\n", "[quantifiers](searchQuantifiers.ipynb)\n", "[from MQL](searchFromMQL.ipynb)\n", "rough\n", "\n", "You have seen cases where the implementation is to blame.\n", "\n", "Now I want to point to gaps in your understanding:\n", "\n", "[gaps](searchGaps.ipynb)\n", "\n", "---\n", "\n", "* **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n", "* **[share](share.ipynb)** draw in other people's data and let them use yours\n", "* **[export](export.ipynb)** export your dataset as an Emdros database\n", "* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features\n", "* **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus\n", "* **[volumes](volumes.ipynb)** work with selected books only\n", "* **[trees](trees.ipynb)** work with the BHSA data as syntax trees\n", "\n", "CC-BY Dirk Roorda" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }