{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "You might want to consider the [start](search.ipynb) of this tutorial." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:39.818664Z", "start_time": "2018-05-24T10:06:39.796588Z" } }, "outputs": [], "source": [ "from tf.fabric import Fabric\n", "from tf.extra.bhsa import Bhsa" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:41.254515Z", "start_time": "2018-05-24T10:06:41.238046Z" } }, "outputs": [], "source": [ "VERSION = '2017'\n", "DATABASE = '~/github/etcbc'\n", "BHSA = f'bhsa/tf/{VERSION}'\n", "PARA = f'parallels/tf/{VERSION}'\n", "TF = Fabric(locations=[DATABASE], modules=[BHSA, PARA], silent=True )" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:48.865143Z", "start_time": "2018-05-24T10:06:44.712958Z" } }, "outputs": [ { "data": { "text/markdown": [ "**Documentation:** BHSA Feature docs BHSA API Text-Fabric API 5.0.4 Search Reference" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "This notebook online:\n", "NBViewer\n", "GitHub\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "api = TF.load('', silent=True)\n", "api.makeAvailableIn(globals())\n", "B = Bhsa(api, 'search', version=VERSION)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rough edges\n", "\n", "It might be helpful to peek under the hood, especially when exploring searches that go slow.\n", "\n", "If you went through the previous parts of the tutorial you have encountered cases where things come\n", "to a grinding halt.\n", "\n", "Yet we can get a hunch of what is going on, even in those cases.\n", "For that, we use the lower-level search api `S` of Text-Fabric, and not the \n", "wrappers that the corpus specific `B` api provides.\n", "\n", "The main difference is, that `S.search()` returns a *generator* of the results, \n", "whereas `B.search()` returns a list of the results.\n", "In fact, `B.search()` calls the generator function delivered by `S.search()` as often as needed.\n", "\n", "For some queries, the fetching of results is quite costly, so costly that we do not want to fetch\n", "all results up-front. Rather we want to fetch a few, to see how it goes.\n", "In these cases, directly using `S.search()` is preferred over `B.search()`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:49:43.870215Z", "start_time": "2018-05-24T08:49:43.866722Z" } }, "outputs": [], "source": [ "query = '''\n", "book\n", " chapter\n", " verse\n", " phrase det=und\n", " word lex=>LHJM/\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Study\n", "\n", "First we call `S.study(query)`.\n", "\n", "The syntax will be checked, features loaded, the search space will be set up, narrowed down, \n", "and the fetching of results will be prepared, but not yet executed.\n", "\n", "In order to make the query a bit more interesting, we lift the constraint that the results must be in Genesis 1-2." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:49:46.451394Z", "start_time": "2018-05-24T08:49:45.192096Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " | 0.00s Feature overview: 109 for nodes; 8 for edges; 1 configs; 7 computed\n", " 0.00s Checking search template ...\n", " 0.19s Setting up search space for 5 objects ...\n", " 1.23s Constraining search space with 4 relations ...\n", " 1.24s Setting up retrieval plan ...\n", " 1.26s Ready to deliver results from 2735 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we rush to the results, lets have a look at the *plan*." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:49:49.104091Z", "start_time": "2018-05-24T08:49:49.088781Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.16s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 book\n", " 2 R1 chapter\n", " 3 R2 verse\n", " 4 R3 phrase det=und\n", " 5 R4 word lex=>LHJM/\n", " 6 \n" ] } ], "source": [ "S.showPlan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see already what your results will look like.\n", "Each result `r` is a *tuple* of nodes:\n", "```\n", "(R0, R1, R2, R3, R4)\n", "```\n", "that instantiate the objects in your template.\n", "\n", "In case you are curious, you can get details about the search space as well:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:03.622134Z", "start_time": "2018-05-24T08:50:03.589828Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 5 objects and 4 relations\n", "Results are instantiations of the following objects:\n", "node 0-book ( 29 choices)\n", "node 1-chapter ( 329 choices)\n", "node 2-verse ( 754 choices)\n", "node 3-phrase ( 805 choices)\n", "node 4-word ( 818 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-book ( 29 choices)\n", "edge 0-book [[ 1-chapter ( 15.0 choices)\n", "edge 1-chapter [[ 2-verse ( 2.4 choices)\n", "edge 2-verse [[ 3-phrase ( 1.0 choices)\n", "edge 3-phrase [[ 4-word ( 1.0 choices)\n", " 7.44s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 book\n", " 2 R1 chapter\n", " 3 R2 verse\n", " 4 R3 phrase det=und\n", " 5 R4 word lex=>LHJM/\n", " 6 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The part about the *nodes* shows you how many possible instantiations for each object in your template\n", "has been found.\n", "These are not results yet, because only combinations of instantiations\n", "that satisfy all constraints are results.\n", "\n", "The constraints come from the relations between the objects that you specified.\n", "In this case, there is only an implicit relation: embedding `[[`. \n", "Later on we'll examine all basic relations.\n", "\n", "The part about the *edges* shows you the constraints,\n", "and in what order they will be computed when stitching results together.\n", "In this case the order is exactly the order by which the relations appear in the template,\n", "but that will not always be the case.\n", "Text-Fabric spends some time and ingenuity to find out an optimal *stitch plan*.\n", "Fetching results is like selecting a node, stitching it to another node with an edge,\n", "and so on, until a full stitch of nodes intersects with all the node sets from which they\n", "must be chosen (the yarns).\n", "\n", "Fetching results may take time. \n", "\n", "For some queries, it can take a large amount of time to walk through all results.\n", "Even worse, it may happen that it takes a large amount of time before getting the *first* result.\n", "During stitching, many stitchings will be tried and fail before they can be completed.\n", "\n", "This has to do with search strategies on the one hand,\n", "and the very likely possibility to encounter *pathological* search patterns,\n", "which have billions of results, mostly unintended.\n", "For example, a simple query that asks for 5 words in the Hebrew Bible without further constraints,\n", "will have 425,000 to the power of 5 results.\n", "That is 10-e28 (a one with 28 zeros),\n", "roughly the number of molecules in a few hundred liters of air.\n", "That may not sound much, but it is 10,000 times the amount of bytes\n", "that can be currently stored on the whole Internet.\n", "\n", "Text-Fabric search is not yet done with finding optimal search strategies,\n", "and I hope to refine its arsenal of methods in the future, depending on what you report." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counting results\n", "It is always a good idea to get a feel for the amount of results, before you dive into them head-on." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:45.871673Z", "start_time": "2018-05-24T08:50:45.847217Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 5 ...\n", " | 0.01s 1\n", " | 0.01s 2\n", " | 0.01s 3\n", " | 0.01s 4\n", " | 0.01s 5\n", " 0.02s Done: 5 results\n" ] } ], "source": [ "S.count(progress=1, limit=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We asked for 5 results in total, with a progress message for every one.\n", "That was a bit conservative." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:48.710519Z", "start_time": "2018-05-24T08:50:48.598126Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to 500 ...\n", " | 0.01s 100\n", " | 0.03s 200\n", " | 0.05s 300\n", " | 0.08s 400\n", " | 0.12s 500\n", " 0.12s Done: 500 results\n" ] } ], "source": [ "S.count(progress=100, limit=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still pretty quick, now we want to count all results." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:50:50.003468Z", "start_time": "2018-05-24T08:50:49.859589Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 200 up to the end of the results ...\n", " | 0.02s 200\n", " | 0.07s 400\n", " | 0.11s 600\n", " | 0.13s 800\n", " 0.14s Done: 818 results\n" ] } ], "source": [ "S.count(progress=200, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetching results\n", "\n", "It is time to see something of those results." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:51:06.009618Z", "start_time": "2018-05-24T08:51:05.993571Z" } }, "outputs": [ { "data": { "text/plain": [ "((426585, 426624, 1414190, 651505, 4),\n", " (426585, 426624, 1414191, 651515, 26),\n", " (426585, 426624, 1414192, 651520, 34),\n", " (426585, 426624, 1414193, 651528, 42),\n", " (426585, 426624, 1414193, 651534, 50),\n", " (426585, 426624, 1414194, 651538, 60),\n", " (426585, 426624, 1414195, 651554, 81),\n", " (426585, 426624, 1414196, 651564, 97),\n", " (426585, 426624, 1414197, 651578, 127),\n", " (426585, 426624, 1414198, 651590, 142))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "S.fetch(limit=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not very informative.\n", "Just a quick observation: look at the last column.\n", "These are the result nodes for the `word` part in the query, indicated as `R7` by `showPlan()` before.\n", "And indeed, they are all below 425,000, the number of words in the Hebrew Bible.\n", "\n", "Nevertheless, we want to glean a bit more information off them." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:51:14.214022Z", "start_time": "2018-05-24T08:51:14.118975Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Genesis 1:1 phrase[אֱלֹהִ֑ים ] אֱלֹהִ֑ים \n", " Genesis 1:2 phrase[ר֣וּחַ אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:3 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:4 phrase[אֱלֹהִ֛ים ] אֱלֹהִ֛ים \n", " Genesis 1:4 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:5 phrase[אֱלֹהִ֤ים׀ ] אֱלֹהִ֤ים׀ \n", " Genesis 1:6 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:7 phrase[אֱלֹהִים֮ ] אֱלֹהִים֮ \n", " Genesis 1:8 phrase[אֱלֹהִ֛ים ] אֱלֹהִ֛ים \n", " Genesis 1:9 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n" ] } ], "source": [ "for r in S.fetch(limit=10):\n", " print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Caution\n", "> It is not possible to do `len(S.fetch())`.\n", "Because `fetch()` is a *generator*, not a list.\n", "It will deliver a result every time it is being asked and for as long as there are results,\n", "but it does not know in advance how many there will be.\n", "\n", ">Fetching a result can be costly, because due to the constraints, a lot of possibilities\n", "may have to be tried and rejected before a the next result is found.\n", "\n", "> That is why you often see results coming in at varying speeds when counting them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use `B.table()` to make a list of results.\n", "This function is part of the `Bhsa` API, not of the generic Text-Fabric machinery, as opposed to `S.glean()`.\n", "\n", "So, you can use `S.glean()` for every Text-Fabric corpus, but the output is still not very nice.\n", "`B.table()` gives much nicer output, but works only for the BHSA corpus.\n", "\n", "We put hyperlinks to SHEBANQ under column 3." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:51:19.071119Z", "start_time": "2018-05-24T08:51:18.970211Z" } }, "outputs": [ { "data": { "text/markdown": [ "n | book | chapter | verse | phrase | word\n", "--- | --- | --- | --- | --- | ---\n", "1|Genesis|Genesis 1|Genesis 1:1|אֱלֹהִ֑ים |אֱלֹהִ֑ים \n", "2|Genesis|Genesis 1|Genesis 1:2|ר֣וּחַ אֱלֹהִ֔ים |אֱלֹהִ֔ים \n", "3|Genesis|Genesis 1|Genesis 1:3|אֱלֹהִ֖ים |אֱלֹהִ֖ים \n", "4|Genesis|Genesis 1|Genesis 1:4|אֱלֹהִ֛ים |אֱלֹהִ֛ים \n", "5|Genesis|Genesis 1|Genesis 1:4|אֱלֹהִ֔ים |אֱלֹהִ֔ים \n", "6|Genesis|Genesis 1|Genesis 1:5|אֱלֹהִ֤ים׀ |אֱלֹהִ֤ים׀ \n", "7|Genesis|Genesis 1|Genesis 1:6|אֱלֹהִ֔ים |אֱלֹהִ֔ים \n", "8|Genesis|Genesis 1|Genesis 1:7|אֱלֹהִים֮ |אֱלֹהִים֮ \n", "9|Genesis|Genesis 1|Genesis 1:8|אֱלֹהִ֛ים |אֱלֹהִ֛ים \n", "10|Genesis|Genesis 1|Genesis 1:9|אֱלֹהִ֗ים |אֱלֹהִ֗ים " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "B.table(S.fetch(limit=10), linked=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slow queries\n", "\n", "The search template above has some pretty tight constraints on one of its objects,\n", "so the amount of data to deal with is pretty limited.\n", "\n", "If the constraints are weak, search may become slow.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:35.245388Z", "start_time": "2018-05-24T08:53:35.241669Z" } }, "outputs": [], "source": [ "query = '''\n", "# test\n", "# verse book=Genesis chapter=2 verse=25\n", "verse\n", " clause\n", " \n", " p1:phrase\n", " w1:word\n", " w3:word\n", " w1 < w3\n", "\n", " p2:phrase\n", " w2:word\n", " w1 < w2 \n", " w3 > w2\n", " \n", " p1 < p2 \n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of remarks you may have encountered before.\n", "\n", "* some objects have got a name\n", "* there are additional relations specified between named objects\n", "* `<` means: *comes before*, and `>`: *comes after* in the canonical order for nodes,\n", " which for words means: comes textually before/after, but for other nodes the meaning\n", " is explained [here](https://dans-labs.github.io/text-fabric/Api/General/#navigating-nodes)\n", "* later on we describe those relations in more detail\n", "\n", "> **Note on order**\n", "Look at the words `w1` and `w3` below phrase `p1`.\n", "Although in the template `w1` comes before `w3`, this is not \n", "translated in a search constraint of the same nature.\n", "\n", "> Order between objects in a template is never significant, only embedding is.\n", "\n", "Because order is not significant, you have to specify order yourself, using relations.\n", "\n", "It turns out that this is better than the other way around.\n", "In MQL order *is* significant, and it is very difficult to \n", "search for `w1` and `w2` in any order.\n", "Especially if your are looking for more than 2 complex objects with lots of feature\n", "conditions, your search template would explode if you had to spell out all\n", "possible permutations. See the example of Reinoud Oosting below.\n", "\n", "> **Note on gaps**\n", "Look at the phrases `p1` and `p2`.\n", "We do not specify an order here, only that they are different.\n", "In order to prevent duplicated searches with `p1` and `p2` interchanged, we even \n", "stipulate that `p1 < p2`.\n", "There are many spatial relationships possible between different objects.\n", "In many cases, neither the one comes before the other, nor vice versa.\n", "They can overlap, one can occur in a gap of the other, they can be completely disjoint\n", "and interleaved, etc." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:38.402967Z", "start_time": "2018-05-24T08:53:37.837161Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " | 0.00s Feature overview: 109 for nodes; 8 for edges; 1 configs; 7 computed\n", " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 7 objects ...\n", " 0.45s Constraining search space with 10 relations ...\n", " 0.48s Setting up retrieval plan ...\n", " 0.53s Ready to deliver results from 1897440 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That was quick!\n", "Well, Text-Fabric knows that narrowing down the search space in this case would take ages,\n", "without resulting in a significantly shrunken space.\n", "So it skips doing so for most constraints.\n", "\n", "Let us see the plan, with details." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:40.974372Z", "start_time": "2018-05-24T08:53:40.920239Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 7 objects and 10 relations\n", "Results are instantiations of the following objects:\n", "node 0-verse ( 23213 choices)\n", "node 1-clause ( 88101 choices)\n", "node 2-phrase (253187 choices)\n", "node 3-word (426584 choices)\n", "node 4-word (426584 choices)\n", "node 5-phrase (253187 choices)\n", "node 6-word (426584 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-verse ( 23213 choices)\n", "edge 0-verse [[ 1-clause ( 3.9 choices)\n", "edge 1-clause [[ 5-phrase ( 2.6 choices)\n", "edge 5-phrase [[ 6-word ( 1.0 choices)\n", "edge 1-clause [[ 2-phrase ( 2.9 choices)\n", "edge 2-phrase < 5-phrase (126593.5 choices)\n", "edge 2-phrase [[ 4-word ( 1.1 choices)\n", "edge 4-word > 6-word (213292.0 choices)\n", "edge 2-phrase [[ 3-word ( 2.2 choices)\n", "edge 3-word < 6-word (213292.0 choices)\n", "edge 3-word < 4-word (213292.0 choices)\n", " 2.55s The results are connected to the original search template as follows:\n", " 0 \n", " 1 # test\n", " 2 # verse book=Genesis chapter=2 verse=25\n", " 3 R0 verse\n", " 4 R1 clause\n", " 5 \n", " 6 R2 p1:phrase\n", " 7 R3 w1:word\n", " 8 R4 w3:word\n", " 9 w1 < w3\n", "10 \n", "11 R5 p2:phrase\n", "12 R6 w2:word\n", "13 w1 < w2 \n", "14 w3 > w2\n", "15 \n", "16 p1 < p2 \n", "17 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, we have a hefty search space here.\n", "Let us play with the `count()` function." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:43.176732Z", "start_time": "2018-05-24T08:53:42.972063Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 10 up to 100 ...\n", " | 0.08s 10\n", " | 0.09s 20\n", " | 0.09s 30\n", " | 0.11s 40\n", " | 0.11s 50\n", " | 0.12s 60\n", " | 0.13s 70\n", " | 0.13s 80\n", " | 0.13s 90\n", " | 0.14s 100\n", " 0.14s Done: 100 results\n" ] } ], "source": [ "S.count(progress=10, limit=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can be bolder than this!" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:45.993373Z", "start_time": "2018-05-24T08:53:45.182241Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to 1000 ...\n", " | 0.11s 100\n", " | 0.14s 200\n", " | 0.15s 300\n", " | 0.30s 400\n", " | 0.35s 500\n", " | 0.35s 600\n", " | 0.41s 700\n", " | 0.52s 800\n", " | 0.54s 900\n", " | 0.67s 1000\n", " 0.68s Done: 1000 results\n" ] } ], "source": [ "S.count(progress=100, limit=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, not too bad, but note that it takes a big fraction of a second to get just 100 results.\n", "\n", "Now let us go for all of them by the thousand." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:53:59.440736Z", "start_time": "2018-05-24T08:53:51.899813Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1000 up to the end of the results ...\n", " | 0.63s 1000\n", " | 0.98s 2000\n", " | 1.35s 3000\n", " | 1.70s 4000\n", " | 2.14s 5000\n", " | 3.08s 6000\n", " | 4.81s 7000\n", " 5.57s Done: 7618 results\n" ] } ], "source": [ "S.count(progress=1000, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See? This is substantial work." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:54:02.778931Z", "start_time": "2018-05-24T08:54:02.657595Z" } }, "outputs": [ { "data": { "text/markdown": [ "n | verse | clause | phrase | word | word | phrase | word\n", "--- | --- | --- | --- | --- | --- | --- | ---\n", "1|Genesis 2:25|וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ |הָֽ|עֲרוּמִּ֔ים |עֲרוּמִּ֔ים \n", "2|Genesis 2:25|וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ |אָדָ֖ם |עֲרוּמִּ֔ים |עֲרוּמִּ֔ים \n", "3|Genesis 2:25|וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ |וְ|עֲרוּמִּ֔ים |עֲרוּמִּ֔ים \n", "4|Genesis 2:25|וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |שְׁנֵיהֶם֙ |אִשְׁתֹּ֑ו |עֲרוּמִּ֔ים |עֲרוּמִּ֔ים \n", "5|Genesis 4:4|וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא מִבְּכֹרֹ֥ות צֹאנֹ֖ו וּמֵֽחֶלְבֵהֶ֑ן |הֶ֨בֶל גַם־ה֛וּא |הֶ֨בֶל |גַם־|הֵבִ֥יא |הֵבִ֥יא \n", "6|Genesis 4:4|וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא מִבְּכֹרֹ֥ות צֹאנֹ֖ו וּמֵֽחֶלְבֵהֶ֑ן |הֶ֨בֶל גַם־ה֛וּא |הֶ֨בֶל |ה֛וּא |הֵבִ֥יא |הֵבִ֥יא \n", "7|Genesis 10:21|גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |גַּם־|אֲחִ֖י |אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר |עֵ֔בֶר \n", "8|Genesis 10:21|גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |ה֑וּא |אֲחִ֖י |אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר |עֵ֔בֶר \n", "9|Genesis 10:21|גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |גַּם־|יֶ֥פֶת |אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר |עֵ֔בֶר \n", "10|Genesis 10:21|גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ |ה֑וּא |יֶ֥פֶת |אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר |עֵ֔בֶר " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "B.table(S.fetch(limit=10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hand-coding\n", "\n", "As a check, here is some code that looks for basically the same phenomenon:\n", "a phrase within the gap of another phrase.\n", "It does not use search, and it gets a bit more focused results, in half the time compared\n", "to the search with the template.\n", "\n", "> **Hint**\n", "If you are comfortable with programming, and what you look for is fairly generic,\n", "you may be better off without search, provided you can translate your insight in the\n", "data into an effective procedure within Text-Fabric.\n", "But wait till we are completely done with this example!" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:54:13.108437Z", "start_time": "2018-05-24T08:54:10.074685Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Getting gapped phrases\n", " 3.04s 368 results\n" ] } ], "source": [ "indent(reset=True)\n", "info('Getting gapped phrases')\n", "results = []\n", "for v in F.otype.s('verse'):\n", " for c in L.d(v, otype='clause'):\n", " ps = L.d(c, otype='phrase')\n", " first = {}\n", " last = {}\n", " slots = {}\n", " # make index of phrase boundaries\n", " for p in ps:\n", " words = L.d(p, otype='word')\n", " first[p] = words[0]\n", " last[p] = words[-1]\n", " slots[p] = set(words)\n", " for p1 in ps:\n", " for p2 in ps:\n", " if p2 < p1: continue\n", " if len(slots[p1] & slots[p2]) != 0: continue\n", " if first[p1] < first[p2] and last[p2] < last[p1]:\n", " results.append((v, c, p1, p2, first[p1], first[p2], last[p2], last[p1]))\n", "info('{} results'.format(len(results)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pretty printing\n", "\n", "We can use the pretty printing of `B.table()` and `B.show()` here as well, even though we have\n", "not used search!\n", "\n", "Not that you can show the node numbers. In this case it helps to see where the gaps are." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:54:34.120441Z", "start_time": "2018-05-24T08:54:34.112139Z" } }, "outputs": [ { "data": { "text/markdown": [ "n | verse | clause | phrase | phrase | word | word | word | word\n", "--- | --- | --- | --- | --- | --- | --- | --- | ---\n", "1|Genesis 2:25 *1414245* |וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו *427767* |שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו *652147* |עֲרוּמִּ֔ים *652148* |שְׁנֵיהֶם֙ *1159* |עֲרוּמִּ֔ים *1160* |עֲרוּמִּ֔ים *1160* |אִשְׁתֹּ֑ו *1164* \n", "2|Genesis 4:4 *1414273* |וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא מִבְּכֹרֹ֥ות צֹאנֹ֖ו וּמֵֽחֶלְבֵהֶ֑ן *427889* |הֶ֨בֶל גַם־ה֛וּא *652504* |הֵבִ֥יא *652505* |הֶ֨בֶל *1720* |הֵבִ֥יא *1721* |הֵבִ֥יא *1721* |ה֛וּא *1723* \n", "3|Genesis 10:21 *1414445* |גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ *428386* |גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַגָּדֹֽול׃ *654102* |אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר *654103* |גַּם־ *4819* |אֲבִי֙ *4821* |עֵ֔בֶר *4824* |גָּדֹֽול׃ *4828* \n", "4|Genesis 12:17 *1414505* |וַיְנַגַּ֨ע יְהוָ֧ה׀ אֶת־פַּרְעֹ֛ה נְגָעִ֥ים גְּדֹלִ֖ים וְאֶת־בֵּיתֹ֑ו עַל־דְּבַ֥ר שָׂרַ֖י אֵ֥שֶׁת אַבְרָֽם׃ *428569* |אֶת־פַּרְעֹ֛ה וְאֶת־בֵּיתֹ֑ו *654678* |נְגָעִ֥ים גְּדֹלִ֖ים *654679* |אֶת־ *5803* |נְגָעִ֥ים *5805* |גְּדֹלִ֖ים *5806* |בֵּיתֹ֑ו *5809* \n", "5|Genesis 13:1 *1414509* |וַיַּעַל֩ אַבְרָ֨ם מִמִּצְרַ֜יִם ה֠וּא וְאִשְׁתֹּ֧ו וְכָל־הַנֶּֽגְבָּה׃ *428585* |אַבְרָ֨ם ה֠וּא וְאִשְׁתֹּ֧ו וְכָל־ *654725* |מִמִּצְרַ֜יִם *654726* |אַבְרָ֨ם *5868* |מִ *5869* |מִּצְרַ֜יִם *5870* |כָל־ *5875* \n", "6|Genesis 14:16 *1414542* |וְגַם֩ אֶת־לֹ֨וט אָחִ֤יו וּרְכֻשֹׁו֙ הֵשִׁ֔יב וְגַ֥ם אֶת־הַנָּשִׁ֖ים וְאֶת־הָעָֽם׃ *428692* |גַם֩ אֶת־לֹ֨וט אָחִ֤יו וּרְכֻשֹׁו֙ וְגַ֥ם אֶת־הַנָּשִׁ֖ים וְאֶת־הָעָֽם׃ *655061* |הֵשִׁ֔יב *655062* |גַם֩ *6515* |הֵשִׁ֔יב *6521* |הֵשִׁ֔יב *6521* |עָֽם׃ *6530* \n", "7|Genesis 17:7 *1414594* |לִהְיֹ֤ות לְךָ֙ לֵֽאלֹהִ֔ים וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ *428886* |לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ *655642* |לֵֽאלֹהִ֔ים *655643* |לְךָ֙ *7431* |לֵֽ *7432* |אלֹהִ֔ים *7433* |אַחֲרֶֽיךָ׃ *7437* \n", "8|Genesis 19:4 *1414651* |וְאַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ נָסַ֣בּוּ עַל־הַבַּ֔יִת מִנַּ֖עַר וְעַד־זָקֵ֑ן כָּל־הָעָ֖ם מִקָּצֶֽה׃ *429128* |אַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ מִנַּ֖עַר וְעַד־זָקֵ֑ן כָּל־הָעָ֖ם מִקָּצֶֽה׃ *656353* |נָסַ֣בּוּ *656354* |אַנְשֵׁ֨י *8502* |נָסַ֣בּוּ *8507* |נָסַ֣בּוּ *8507* |קָּצֶֽה׃ *8520* \n", "9|Genesis 19:4 *1414651* |וְאַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ נָסַ֣בּוּ עַל־הַבַּ֔יִת מִנַּ֖עַר וְעַד־זָקֵ֑ן כָּל־הָעָ֖ם מִקָּצֶֽה׃ *429128* |אַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ מִנַּ֖עַר וְעַד־זָקֵ֑ן כָּל־הָעָ֖ם מִקָּצֶֽה׃ *656353* |עַל־הַבַּ֔יִת *656355* |אַנְשֵׁ֨י *8502* |עַל־ *8508* |בַּ֔יִת *8510* |קָּצֶֽה׃ *8520* \n", "10|Genesis 22:3 *1414740* |וַיִּקַּ֞ח אֶת־שְׁנֵ֤י נְעָרָיו֙ אִתֹּ֔ו וְאֵ֖ת יִצְחָ֣ק בְּנֹ֑ו *429497* |אֶת־שְׁנֵ֤י נְעָרָיו֙ וְאֵ֖ת יִצְחָ֣ק בְּנֹ֑ו *657505* |אִתֹּ֔ו *657506* |אֶת־ *10284* |אִתֹּ֔ו *10287* |אִתֹּ֔ו *10287* |בְּנֹ֑ו *10291* " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "\n", "**verse** *1*\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " sentence 59\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " clause WayX\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Conj CP\n", "
\n", "
\n", "\n", "
\n", "\n", "
conj and
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Pred VP\n", "
\n", "
\n", "\n", "
\n", "\n", "
verb be qal wayq
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Subj NP\n", "
\n", "
\n", "\n", "
\n", "\n", "
subs two
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase PreC AdjP\n", "
\n", "
\n", "\n", "
\n", "\n", "
adjv naked
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Subj NP\n", "
\n", "
\n", "\n", "
\n", "\n", "
art the
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
subs human, mankind
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
conj and
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
subs woman
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " sentence 60\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " clause WxY0\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Conj CP\n", "
\n", "
\n", "\n", "
\n", "\n", "
conj and
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Nega NegP\n", "
\n", "
\n", "\n", "
\n", "\n", "
nega not
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", " phrase Pred VP\n", "
\n", "
\n", "\n", "
\n", "\n", "
verb be ashamed hit impf
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "B.table(results, withNodes=True, end=10)\n", "B.show(results, start=1, end=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NB**\n", "Gaps are a tricky phenomenon. In [gaps](searchGaps.ipynb) we will deal with them cruelly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Next\n", "\n", "You have seen cases where the implementation is to blame.\n", "\n", "Now I want to point to gaps in your understanding:\n", "[gaps](searchGaps.ipynb)\n", "\n", "---\n", "\n", "[basic](search.ipynb)\n", "[advanced](searchAdvanced.ipynb)\n", "[relations](searchRelations.ipynb)\n", "[quantifiers](searchQuantifiers.ipynb)\n", "rough\n", "[gaps](searchGaps.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }