{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "# Search\n", "\n", "*Search* in Text-Fabric is a template based way of looking for structural patterns in your dataset.\n", "\n", "It is inspired by the idea of\n", "[topographic query](http://books.google.nl/books?id=9ggOBRz1dO4C),\n", "as worked out in \n", "[MQL](https://shebanq.ancient-data.org/shebanq/static/docs/MQL-Query-Guide.pdf)\n", "which has been implemented in \n", "[Emdros](http://emdros.org).\n", "See also [pitfalls of MQL](https://etcbc.github.io/bhsa/mql#pitfalls-of-mql)\n", "\n", "Within Text-Fabric we have the unique possibility to combine the ease of formulating search templates for\n", "complicated syntactical patterns with the power of programmatically processing the results.\n", "\n", "This notebook will show you how to get up and running.\n", "\n", "See the notebook\n", "[searchFromMQL](searchFromMQL.ipynb)\n", "for examples how MQL queries can be expressed in Text-Fabric search.\n", "\n", "# Before we continue\n", "Search is a big feature in Text-Fabric.\n", "It is also a very recent addition.\n", "\n", "##### Caution:\n", "> There might be bugs.\n", "\n", "Search is also costly.\n", "Quite a bit of the implementation work has been dedicated to optimize performance.\n", "But it is worth the price: search templates are powerful for a wide range of purposes.\n", "I do not pretend, however, to have found optimal strategies for all \n", "possible search templates.\n", "\n", "That being said, I think search might turn out helpful in many cases,\n", "and I welcome your feedback.\n", "\n", "*Dirk Roorda, 2016-12-23, updates 2017-10-10*\n", "\n", "# Search command\n", "\n", "Search is as simple as saying (just an example)\n", "\n", "```python\n", "for r in S.search(template): print(S.glean(r))\n", "```\n", "\n", "See all ins and outs in the\n", "[search template reference](https://github.com/Dans-labs/text-fabric/wiki/Api#search-template-reference).\n", "\n", "All search related things use the\n", "[`S` api](https://github.com/Dans-labs/text-fabric/wiki/Api#search)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from tf.fabric import Fabric" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 3.0.9\n", "Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api\n", "Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb\n", "Example data : https://github.com/Dans-labs/text-fabric-data\n", "\n", "114 features found and 0 ignored\n" ] } ], "source": [ "DATABASE = '~/github/etcbc'\n", "BHSA = 'bhsa/tf/2017'\n", "TF = Fabric(locations=[DATABASE], modules=[BHSA], silent=False )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us just *not* load any specific features." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "api = TF.load('', silent=True)\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic search command\n", "\n", "We start with the most simple form of issuing a query.\n", "Let's look for the word Elohim in undetermined phrases, only in Genesis 1-2.\n", "\n", "All work involved in searching takes place under the hood." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Genesis 1:1 phrase[אֱלֹהִ֑ים ] אֱלֹהִ֑ים \n", " Genesis 1:2 phrase[ר֣וּחַ אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:3 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:4 phrase[אֱלֹהִ֛ים ] אֱלֹהִ֛ים \n", " Genesis 1:4 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:5 phrase[אֱלֹהִ֤ים׀ ] אֱלֹהִ֤ים׀ \n", " Genesis 1:6 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:7 phrase[אֱלֹהִים֮ ] אֱלֹהִים֮ \n", " Genesis 1:8 phrase[אֱלֹהִ֛ים ] אֱלֹהִ֛ים \n", " Genesis 1:9 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n", " Genesis 1:10 phrase[אֱלֹהִ֤ים׀ ] אֱלֹהִ֤ים׀ \n", " Genesis 1:10 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:11 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n", " Genesis 1:12 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:14 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n", " Genesis 1:16 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:17 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:18 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:20 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:21 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:21 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:22 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:24 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n", " Genesis 1:25 phrase[אֱלֹהִים֩ ] אֱלֹהִים֩ \n", " Genesis 1:25 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:26 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:27 phrase[אֱלֹהִ֤ים׀ ] אֱלֹהִ֤ים׀ \n", " Genesis 1:27 phrase[בְּצֶ֥לֶם אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:28 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n", " Genesis 1:28 phrase[אֱלֹהִים֒ ] אֱלֹהִים֒ \n", " Genesis 1:29 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n", " Genesis 1:31 phrase[אֱלֹהִים֙ ] אֱלֹהִים֙ \n", " Genesis 2:2 phrase[אֱלֹהִים֙ ] אֱלֹהִים֙ \n", " Genesis 2:3 phrase[אֱלֹהִים֙ ] אֱלֹהִים֙ \n", " Genesis 2:3 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n" ] } ], "source": [ "query = '''\n", "book book=Genesis\n", " chapter chapter=1|2\n", " verse\n", " phrase det=und\n", " word lex=>LHJM/\n", "'''\n", "for r in S.search(query): print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Under the hood\n", "\n", "It might be helpful to peek under the hood, especially when exploring new searches.\n", "We feed the query to the search API, which will *study* it.\n", "The syntax will be checked, features loaded, the search space will be set up, narrowed down, \n", "and the fetching of results will be prepared, but not yet executed.\n", "\n", "In order to make the query a bit more interesting, we lift the constraint that the results must be in Genesis 1-2." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "book\n", " chapter\n", " verse\n", " phrase det=und\n", " word lex=>LHJM/\n", "'''" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 5 objects ...\n", " 0.99s Constraining search space with 4 relations ...\n", " 1.00s Setting up retrieval plan ...\n", " 1.02s Ready to deliver results from 2735 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we rush to the results, lets have a look at the *plan*." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 11s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 book\n", " 2 R1 chapter\n", " 3 R2 verse\n", " 4 R3 phrase det=und\n", " 5 R4 word lex=>LHJM/\n", " 6 \n" ] } ], "source": [ "S.showPlan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see already what your results will look like.\n", "Each result `r` is a *tuple* of nodes:\n", "```\n", "(R0, R1, R2, R3, R4)\n", "```\n", "that instantiate the objects in your template.\n", "\n", "## Excursion\n", "In case you are curious, you can get details about the search space as well:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 5 objects and 4 relations\n", "Results are instantiations of the following objects:\n", "node 0-book ( 29 choices)\n", "node 1-chapter ( 329 choices)\n", "node 2-verse ( 754 choices)\n", "node 3-phrase ( 805 choices)\n", "node 4-word ( 818 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-book ( 29 choices)\n", "edge 0-book [[ 1-chapter ( 10.6 choices)\n", "edge 1-chapter [[ 2-verse ( 1.8 choices)\n", "edge 2-verse [[ 3-phrase ( 1.1 choices)\n", "edge 3-phrase [[ 4-word ( 1.0 choices)\n", " 19s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 book\n", " 2 R1 chapter\n", " 3 R2 verse\n", " 4 R3 phrase det=und\n", " 5 R4 word lex=>LHJM/\n", " 6 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The part about the *nodes* shows you how many possible instantiations for each object in your template\n", "has been found.\n", "These are not results yet, because only combinations of instantiations\n", "that satisfy all constraints are results.\n", "\n", "The constraints come from the relations between the objects that you specified.\n", "In this case, there is only an implicit relation: embedding `[[`. \n", "Later on we'll examine all basic relations.\n", "\n", "The part about the *edges* shows you the constraints,\n", "and in what order they will be computed when stitching results together.\n", "In this case the order is exactly the order by which the relations appear in the template,\n", "but that will not always be the case.\n", "Text-Fabric spends some time and ingenuity to find out an optimal *stitch plan*.\n", "\n", "Nevertheless, fetching results may take time. \n", "\n", "For some queries, it can take a large amount of time to walk through all results.\n", "Even worse, it may happen that it takes a large amount of time before getting the *first* result.\n", "\n", "This has to do with search strategies on the one hand,\n", "and the very likely possibility to encounter *pathological* search patterns,\n", "which have billions of results, mostly unintended.\n", "For example, a simple query that asks for 5 words in the Hebrew Bible without further constraints,\n", "will have 425,000 to the power of 5 results.\n", "That is 10-e28 (a one with 28 zeros),\n", "roughly the number of molecules in a few hundred liters of air.\n", "That may not sound much, but it is 10,000 times the amount of bytes\n", "that can be currently stored on the whole Internet.\n", "\n", "Text-Fabric search is not yet done with finding optimal search strategies,\n", "and I hope to refine its arsenal of methods in the future, depending on what you report.\n", "\n", "## Back to business\n", "It is always a good idea to get a feel for the amount of results, before you dive into them head-on." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 5 ...\n", " | 0.01s 1\n", " | 0.01s 2\n", " | 0.01s 3\n", " | 0.01s 4\n", " | 0.02s 5\n", " 0.02s Done: 5 results\n" ] } ], "source": [ "S.count(progress=1, limit=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We asked for 5 results in total, with a progress message for every one.\n", "That was a bit conservative." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to 500 ...\n", " | 0.01s 100\n", " | 0.03s 200\n", " | 0.07s 300\n", " | 0.09s 400\n", " | 0.13s 500\n", " 0.14s Done: 500 results\n" ] } ], "source": [ "S.count(progress=100, limit=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still pretty quick, now we want to count all results." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 200 up to the end of the results ...\n", " | 0.03s 200\n", " | 0.09s 400\n", " | 0.13s 600\n", " | 0.15s 800\n", " 0.15s Done: 818 results\n" ] } ], "source": [ "S.count(progress=200, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is time to see something of those results." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((426585, 426624, 1414190, 651505, 4),\n", " (426585, 426624, 1414191, 651515, 26),\n", " (426585, 426624, 1414192, 651520, 34),\n", " (426585, 426624, 1414193, 651528, 42),\n", " (426585, 426624, 1414193, 651534, 50),\n", " (426585, 426624, 1414194, 651538, 60),\n", " (426585, 426624, 1414195, 651554, 81),\n", " (426585, 426624, 1414196, 651564, 97),\n", " (426585, 426624, 1414197, 651578, 127),\n", " (426585, 426624, 1414198, 651590, 142))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "S.fetch(limit=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not very informative.\n", "Just a quick observation: look at the last column.\n", "These are the result nodes for the `word` part in the query, indicated as `R7` by `showPlan()` before.\n", "And indeed, they are all below 425,000, the number of words in the Hebrew Bible.\n", "\n", "Nevertheless, we want to glean a bit more information off them." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Genesis 1:1 phrase[אֱלֹהִ֑ים ] אֱלֹהִ֑ים \n", " Genesis 1:2 phrase[ר֣וּחַ אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:3 phrase[אֱלֹהִ֖ים ] אֱלֹהִ֖ים \n", " Genesis 1:4 phrase[אֱלֹהִ֛ים ] אֱלֹהִ֛ים \n", " Genesis 1:4 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:5 phrase[אֱלֹהִ֤ים׀ ] אֱלֹהִ֤ים׀ \n", " Genesis 1:6 phrase[אֱלֹהִ֔ים ] אֱלֹהִ֔ים \n", " Genesis 1:7 phrase[אֱלֹהִים֮ ] אֱלֹהִים֮ \n", " Genesis 1:8 phrase[אֱלֹהִ֛ים ] אֱלֹהִ֛ים \n", " Genesis 1:9 phrase[אֱלֹהִ֗ים ] אֱלֹהִ֗ים \n" ] } ], "source": [ "for r in S.fetch(limit=10):\n", " print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Caution\n", "> It is not possible to do `len(S.fetch())`.\n", "Because `fetch()` is a *generator*, not a list.\n", "It will deliver a result every time it is being asked and for as long as there are results,\n", "but it does not know in advance how many there will be.\n", "\n", ">Fetching a result can be costly, because due to the constraints, a lot of possibilities\n", "may have to be tried and rejected before a the next result is found.\n", "\n", "> That is why you often see results coming in at varying speeds when counting them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This search template has some pretty tight constraints on one of its objects,\n", "so the amount of data to deal with is pretty limited.\n", "\n", "Let us turn to a template where this is not so." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "# test\n", "# verse book=Genesis chapter=2 verse=25\n", "verse\n", " clause\n", " \n", " p1:phrase\n", " w1:word\n", " w3:word\n", " w1 < w3\n", "\n", " p2:phrase\n", " w2:word\n", " w1 < w2 \n", " w3 > w2\n", " \n", " p1 < p2 \n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of remarks.\n", "\n", "* some objects have got a name\n", "* there are additional relations specified between named objects\n", "* `<` means: *comes before*, and `>`: *comes after* in the canonical order for nodes,\n", " which for words means: comes textually before/after, but for other nodes the meaning\n", " is explained [here](https://github.com/Dans-labs/text-fabric/wiki/Api#sorting-nodes)\n", "* later on we describe those relations in more detail\n", "\n", "##### Note on order\n", "> Look at the words `w1` and `w3` below phrase `p1`.\n", "Although in the template `w1` comes before `w3`, this is not \n", "translated in a search constraint of the same nature.\n", "\n", "> Order between objects in a template is never significant, only embedding is.\n", "\n", "Because order is not significant, you have to specify order yourself, using relations.\n", "\n", "It turns out that this is better than the other way around.\n", "In MQL order *is* significant, and it is very difficult to \n", "search for `w1` and `w2` in any order.\n", "Especially if your are looking for more than 2 complex objects with lots of feature\n", "conditions, your search template would explode if you had to spell out all\n", "possible permutations. See the example of Reinoud Oosting below.\n", "\n", "##### Note on gaps\n", "> Look at the phrases `p1` and `p2`.\n", "We do not specify an order here, only that they are different.\n", "In order to prevent duplicated searches with `p1` and `p2` interchanged, we even \n", "stipulate that `p1 < p2`.\n", "There are many spatial relationships possible between different objects.\n", "In many cases, neither the one comes before the other, nor vice versa.\n", "They can overlap, one can occur in a gap of the other, they can be completely disjoint\n", "and interleaved, etc." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 7 objects ...\n", " 0.48s Constraining search space with 10 relations ...\n", " 0.51s Setting up retrieval plan ...\n", " 0.56s Ready to deliver results from 1897440 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That was quick!\n", "Well, Text-Fabric knows that narrowing down the search space in this case would take ages,\n", "without resulting in a significantly shrunken space.\n", "So it skips doing so for most constraints.\n", "\n", "Let us see the plan, with details." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 7 objects and 10 relations\n", "Results are instantiations of the following objects:\n", "node 0-verse ( 23213 choices)\n", "node 1-clause ( 88101 choices)\n", "node 2-phrase (253187 choices)\n", "node 3-word (426584 choices)\n", "node 4-word (426584 choices)\n", "node 5-phrase (253187 choices)\n", "node 6-word (426584 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-verse ( 23213 choices)\n", "edge 0-verse [[ 1-clause ( 3.6 choices)\n", "edge 1-clause [[ 5-phrase ( 2.5 choices)\n", "edge 5-phrase [[ 6-word ( 2.3 choices)\n", "edge 1-clause [[ 2-phrase ( 3.4 choices)\n", "edge 2-phrase < 5-phrase (126593.5 choices)\n", "edge 2-phrase [[ 3-word ( 1.6 choices)\n", "edge 3-word < 6-word (213292.0 choices)\n", "edge 2-phrase [[ 4-word ( 1.6 choices)\n", "edge 4-word > 6-word (213292.0 choices)\n", "edge 3-word < 4-word (213292.0 choices)\n", " 4.13s The results are connected to the original search template as follows:\n", " 0 \n", " 1 # test\n", " 2 # verse book=Genesis chapter=2 verse=25\n", " 3 R0 verse\n", " 4 R1 clause\n", " 5 \n", " 6 R2 p1:phrase\n", " 7 R3 w1:word\n", " 8 R4 w3:word\n", " 9 w1 < w3\n", "10 \n", "11 R5 p2:phrase\n", "12 R6 w2:word\n", "13 w1 < w2 \n", "14 w3 > w2\n", "15 \n", "16 p1 < p2 \n", "17 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, we have a hefty search space here.\n", "Let us play with the `count()` function." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 10 up to 100 ...\n", " | 0.14s 10\n", " | 0.14s 20\n", " | 0.14s 30\n", " | 0.17s 40\n", " | 0.17s 50\n", " | 0.18s 60\n", " | 0.20s 70\n", " | 0.20s 80\n", " | 0.20s 90\n", " | 0.21s 100\n", " 0.21s Done: 100 results\n" ] } ], "source": [ "S.count(progress=10, limit=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can be bolder than this!" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to 1000 ...\n", " | 0.18s 100\n", " | 0.23s 200\n", " | 0.23s 300\n", " | 0.46s 400\n", " | 0.53s 500\n", " | 0.54s 600\n", " | 0.64s 700\n", " | 0.81s 800\n", " | 0.84s 900\n", " | 1.07s 1000\n", " 1.07s Done: 1000 results\n" ] } ], "source": [ "S.count(progress=100, limit=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, not too bad, but note that it takes a big fraction of a second to get just 100 results.\n", "\n", "Now let us go for all of them by the thousand." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1000 up to the end of the results ...\n", " | 1.01s 1000\n", " | 1.60s 2000\n", " | 2.26s 3000\n", " | 2.87s 4000\n", " | 3.58s 5000\n", " | 4.82s 6000\n", " | 7.56s 7000\n", " 8.98s Done: 7618 results\n" ] } ], "source": [ "S.count(progress=1000, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See? This is substantial work." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 2:25 clause[וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽ...] phrase[שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו ] שְׁנֵיהֶם֙ הָֽ phrase[עֲרוּמִּ֔ים ] עֲרוּמִּ֔ים \n", "Genesis 2:25 clause[וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽ...] phrase[שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו ] שְׁנֵיהֶם֙ אָדָ֖ם phrase[עֲרוּמִּ֔ים ] עֲרוּמִּ֔ים \n", "Genesis 2:25 clause[וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽ...] phrase[שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו ] שְׁנֵיהֶם֙ וְ phrase[עֲרוּמִּ֔ים ] עֲרוּמִּ֔ים \n", "Genesis 2:25 clause[וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽ...] phrase[שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו ] שְׁנֵיהֶם֙ אִשְׁתֹּ֑ו phrase[עֲרוּמִּ֔ים ] עֲרוּמִּ֔ים \n", "Genesis 4:4 clause[וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא ...] phrase[הֶ֨בֶל גַם־ה֛וּא ] הֶ֨בֶל גַם־ phrase[הֵבִ֥יא ] הֵבִ֥יא \n", "Genesis 4:4 clause[וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא ...] phrase[הֶ֨בֶל גַם־ה֛וּא ] הֶ֨בֶל ה֛וּא phrase[הֵבִ֥יא ] הֵבִ֥יא \n", "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] גַּם־ אֲחִ֖י phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ] עֵ֔בֶר \n", "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] גַּם־ יֶ֥פֶת phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ] עֵ֔בֶר \n", "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] גַּם־ הַ phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ] עֵ֔בֶר \n", "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] גַּם־ גָּדֹֽול׃ phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ] עֵ֔בֶר \n" ] } ], "source": [ "for r in S.fetch(limit=10):\n", " print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a check, here is some code that looks for basically the same phenomenon:\n", "a phrase within the gap of another phrase.\n", "It does not use search, and it gets a bit more focused results, in half the time compared\n", "to the search with the template.\n", "\n", "##### Hint\n", "> If you are comfortable with programming, and what you look for is fairly generic,\n", "you may be better off without search, provided you can translate your insight in the\n", "data into an effective procedure within Text-Fabric.\n", "But wait till we are completely done with this example!" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Getting gapped phrases\n", " 3.21s 368 results\n", "(1414245, 427767, 652147, 652148, 1159, 1160, 1160, 1164)\n", "(1414273, 427889, 652504, 652505, 1720, 1721, 1721, 1723)\n", "(1414445, 428386, 654102, 654103, 4819, 4821, 4824, 4828)\n", "(1414505, 428569, 654678, 654679, 5803, 5805, 5806, 5809)\n", "(1414509, 428585, 654725, 654726, 5868, 5869, 5870, 5875)\n", "(1414542, 428692, 655061, 655062, 6515, 6521, 6521, 6530)\n", "(1414594, 428886, 655642, 655643, 7431, 7432, 7433, 7437)\n", "(1414651, 429128, 656353, 656354, 8502, 8507, 8507, 8520)\n", "(1414651, 429128, 656353, 656355, 8502, 8508, 8510, 8520)\n", "(1414740, 429497, 657505, 657506, 10284, 10287, 10287, 10291)\n" ] } ], "source": [ "indent(reset=True)\n", "info('Getting gapped phrases')\n", "results = []\n", "for v in F.otype.s('verse'):\n", " for c in L.d(v, otype='clause'):\n", " ps = L.d(c, otype='phrase')\n", " first = {}\n", " last = {}\n", " slots = {}\n", " # make index of phrase boundaries\n", " for p in ps:\n", " words = L.d(p, otype='word')\n", " first[p] = words[0]\n", " last[p] = words[-1]\n", " slots[p] = set(words)\n", " for p1 in ps:\n", " for p2 in ps:\n", " if p2 < p1: continue\n", " if len(slots[p1] & slots[p2]) != 0: continue\n", " if first[p1] < first[p2] and last[p2] < last[p1]:\n", " results.append((v, c, p1, p2, first[p1], first[p2], last[p2], last[p1]))\n", "info('{} results'.format(len(results)))\n", "for r in results[0:10]:\n", " print(r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we can use the pretty printing of `glean()` here as well, even though we have\n", "not used search!" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 2:25 clause[וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽ...] phrase[שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו ] phrase[עֲרוּמִּ֔ים ] שְׁנֵיהֶם֙ עֲרוּמִּ֔ים עֲרוּמִּ֔ים אִשְׁתֹּ֑ו \n", "Genesis 4:4 clause[וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא ...] phrase[הֶ֨בֶל גַם־ה֛וּא ] phrase[הֵבִ֥יא ] הֶ֨בֶל הֵבִ֥יא הֵבִ֥יא ה֛וּא \n", "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ] גַּם־ אֲבִי֙ עֵ֔בֶר גָּדֹֽול׃ \n", "Genesis 12:17 clause[וַיְנַגַּ֨ע יְהוָ֧ה׀ אֶת־פַּרְעֹ֛ה ...] phrase[אֶת־פַּרְעֹ֛ה וְאֶת־בֵּיתֹ֑ו ] phrase[נְגָעִ֥ים גְּדֹלִ֖ים ] אֶת־ נְגָעִ֥ים גְּדֹלִ֖ים בֵּיתֹ֑ו \n", "Genesis 13:1 clause[וַיַּעַל֩ אַבְרָ֨ם מִמִּצְרַ֜יִם ...] phrase[אַבְרָ֨ם ה֠וּא וְאִשְׁתֹּ֧ו וְ...] phrase[מִמִּצְרַ֜יִם ] אַבְרָ֨ם מִ מִּצְרַ֜יִם כָל־\n", "Genesis 14:16 clause[וְגַם֩ אֶת־לֹ֨וט אָחִ֤יו ...] phrase[גַם֩ אֶת־לֹ֨וט אָחִ֤יו וּ...] phrase[הֵשִׁ֔יב ] גַם֩ הֵשִׁ֔יב הֵשִׁ֔יב עָֽם׃ \n", "Genesis 17:7 clause[לִהְיֹ֤ות לְךָ֙ לֵֽאלֹהִ֔ים ...] phrase[לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ ] phrase[לֵֽאלֹהִ֔ים ] לְךָ֙ לֵֽ אלֹהִ֔ים אַחֲרֶֽיךָ׃ \n", "Genesis 19:4 clause[וְאַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י ...] phrase[אַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ ...] phrase[נָסַ֣בּוּ ] אַנְשֵׁ֨י נָסַ֣בּוּ נָסַ֣בּוּ קָּצֶֽה׃ \n", "Genesis 19:4 clause[וְאַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י ...] phrase[אַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ ...] phrase[עַל־הַבַּ֔יִת ] אַנְשֵׁ֨י עַל־ בַּ֔יִת קָּצֶֽה׃ \n", "Genesis 22:3 clause[וַיִּקַּ֞ח אֶת־שְׁנֵ֤י נְעָרָיו֙ ...] phrase[אֶת־שְׁנֵ֤י נְעָרָיו֙ וְאֵ֖ת ...] phrase[אִתֹּ֔ו ] אֶת־ אִתֹּ֔ו אִתֹּ֔ו בְּנֹ֑ו \n" ] } ], "source": [ "for r in results[0:10]: print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Further on we have another example with gaps, and we get the results by means of a search template in a slightly other way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Refine the search template\n", "\n", "A second look at the results of our search template reveals \n", "that there are multiple results per pair of phrases,\n", "because there are in general multiple words in both\n", "phrases that satisfy the condition.\n", "We can make the search template stricter,\n", "by requiring alignment of the words with the starts and ends of the phrases\n", "they are in.\n", "\n", "For this, we employ a convenient device in search templates that we have not explained yet.\n", "\n", "**Before each atom we may put a relational operator.**\n", "\n", "The meaning is that this relation holds between the preceding atom and the current one.\n", "If there is a lonely operator all by itself on a line, it means that \n", "this relation holds between the preceding sibling and the parent.\n", "\n", "These operators are very handy to indicate that there is an order between siblings,\n", "and also that a child should start or end where the parent starts or ends." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "verse\n", " clause\n", " \n", " p1:phrase\n", " =: w1:word\n", " < w3:word\n", " :=\n", "\n", " p2:phrase\n", " =: w2:word\n", " \n", " p1 < p2\n", " w1 < p2\n", " w2 < w3\n", " \n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The line \n", "\n", "```\n", "=: w1:word\n", "``` \n", "\n", "constrains word `w1` to start exactly at the start of its parent, phrase `p1`.\n", "\n", "The line\n", "\n", "```\n", "< w3:word\n", "```\n", "\n", "constrains the preceding sibling `w1` to come before `w3` in the canonical node ordering.\n", "Because `w1` and `w3` are words, this means that `w1` comes textually before `w3`.\n", "\n", "The line \n", "\n", "```\n", ":=\n", "``` \n", "\n", "constrains the preceding sibling, word `w3` to end exactly at the end of its parent,\n", "phrase `p1`.\n", "\n", "The line \n", "\n", "```\n", "=: w2:word\n", "```\n", "\n", "constrains word `w2` to start exactly at the start of its parent, phrase `p2`.\n", "\n", "Given two phrases `p1` and `p2`, the positions of all three words `w1`, `w2`, `w3` are fixed, so for every pair `p1`, `p2` that satisfies the conditions, \n", "there is exactly one result." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.05s Setting up search space for 7 objects ...\n", " 0.53s Constraining search space with 13 relations ...\n", " 0.57s Setting up retrieval plan ...\n", " 0.66s Ready to deliver results from 1897440 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 7 objects and 13 relations\n", "Results are instantiations of the following objects:\n", "node 0-verse ( 23213 choices)\n", "node 1-clause ( 88101 choices)\n", "node 2-phrase (253187 choices)\n", "node 3-word (426584 choices)\n", "node 4-word (426584 choices)\n", "node 5-phrase (253187 choices)\n", "node 6-word (426584 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-verse ( 23213 choices)\n", "edge 0-verse [[ 1-clause ( 4.1 choices)\n", "edge 1-clause [[ 2-phrase ( 3.0 choices)\n", "edge 2-phrase =: 3-word ( 1.0 choices)\n", "edge 3-word ]] 2-phrase ( 1.0 choices)\n", "edge 2-phrase := 4-word ( 1.0 choices)\n", "edge 4-word ]] 2-phrase ( 1.0 choices)\n", "edge 3-word < 4-word (213292.0 choices)\n", "edge 1-clause [[ 5-phrase ( 3.3 choices)\n", "edge 5-phrase > 2-phrase (126593.5 choices)\n", "edge 3-word < 5-phrase (126593.5 choices)\n", "edge 5-phrase =: 6-word ( 1.0 choices)\n", "edge 6-word ]] 5-phrase ( 1.0 choices)\n", "edge 6-word < 4-word (213292.0 choices)\n", " 0.69s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 verse\n", " 2 R1 clause\n", " 3 \n", " 4 R2 p1:phrase\n", " 5 R3 =: w1:word\n", " 6 R4 < w3:word\n", " 7 :=\n", " 8 \n", " 9 R5 p2:phrase\n", "10 R6 =: w2:word\n", "11 \n", "12 p1 < p2\n", "13 w1 < p2\n", "14 w2 < w3\n", "15 \n", "16 \n", " 0.00s Counting results per 100 up to the end of the results ...\n", " | 0.45s 100\n", " | 1.01s 200\n", " | 1.95s 300\n", " 3.64s Done: 368 results\n", "Genesis 2:25 clause[וַיִּֽהְי֤וּ שְׁנֵיהֶם֙ עֲרוּמִּ֔ים הָֽ...] phrase[שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו ] שְׁנֵיהֶם֙ אִשְׁתֹּ֑ו phrase[עֲרוּמִּ֔ים ] עֲרוּמִּ֔ים \n", "Genesis 4:4 clause[וְהֶ֨בֶל הֵבִ֥יא גַם־ה֛וּא ...] phrase[הֶ֨בֶל גַם־ה֛וּא ] הֶ֨בֶל ה֛וּא phrase[הֵבִ֥יא ] הֵבִ֥יא \n", "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] גַּם־ גָּדֹֽול׃ phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ] אֲבִי֙ \n", "Genesis 12:17 clause[וַיְנַגַּ֨ע יְהוָ֧ה׀ אֶת־פַּרְעֹ֛ה ...] phrase[אֶת־פַּרְעֹ֛ה וְאֶת־בֵּיתֹ֑ו ] אֶת־ בֵּיתֹ֑ו phrase[נְגָעִ֥ים גְּדֹלִ֖ים ] נְגָעִ֥ים \n", "Genesis 13:1 clause[וַיַּעַל֩ אַבְרָ֨ם מִמִּצְרַ֜יִם ...] phrase[אַבְרָ֨ם ה֠וּא וְאִשְׁתֹּ֧ו וְ...] אַבְרָ֨ם כָל־ phrase[מִמִּצְרַ֜יִם ] מִ\n", "Genesis 14:16 clause[וְגַם֩ אֶת־לֹ֨וט אָחִ֤יו ...] phrase[גַם֩ אֶת־לֹ֨וט אָחִ֤יו וּ...] גַם֩ עָֽם׃ phrase[הֵשִׁ֔יב ] הֵשִׁ֔יב \n", "Genesis 17:7 clause[לִהְיֹ֤ות לְךָ֙ לֵֽאלֹהִ֔ים ...] phrase[לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ ] לְךָ֙ אַחֲרֶֽיךָ׃ phrase[לֵֽאלֹהִ֔ים ] לֵֽ\n", "Genesis 19:4 clause[וְאַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י ...] phrase[אַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ ...] אַנְשֵׁ֨י קָּצֶֽה׃ phrase[נָסַ֣בּוּ ] נָסַ֣בּוּ \n", "Genesis 19:4 clause[וְאַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י ...] phrase[אַנְשֵׁ֨י הָעִ֜יר אַנְשֵׁ֤י סְדֹם֙ ...] אַנְשֵׁ֨י קָּצֶֽה׃ phrase[עַל־הַבַּ֔יִת ] עַל־\n", "Genesis 22:3 clause[וַיִּקַּ֞ח אֶת־שְׁנֵ֤י נְעָרָיו֙ ...] phrase[אֶת־שְׁנֵ֤י נְעָרָיו֙ וְאֵ֖ת ...] אֶת־ בְּנֹ֑ו phrase[אִתֹּ֔ו ] אִתֹּ֔ו \n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)\n", "S.count(progress=100, limit=-1)\n", "for r in S.fetch(limit=10): print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here we have exactly the same results as our hand-written piece of code.\n", "\n", "> Note\n", "Now, with the \"duplicate\" results prevented, the search with the template has\n", "only a slight performance overhead compared to the manual piece of code!\n", "\n", "But beware of complications. \n", "Search templates are powerful, but sometimes they lead to a different\n", "result set from what you might think.\n", "Here is an example.\n", "\n", "# A tricky example\n", "\n", "Suppose we want to count the clauses consisting of exactly two phrases.\n", "The following template should do it:\n", "a clause, starting with a phrase, followed by an adjacent phrase,\n", "which terminates the clause." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "clause\n", " =: phrase\n", " <: phrase\n", " :=\n", "'''" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 3 objects ...\n", " 0.15s Constraining search space with 5 relations ...\n", " 0.17s Setting up retrieval plan ...\n", " 0.22s Ready to deliver results from 594475 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 3 objects and 5 relations\n", "Results are instantiations of the following objects:\n", "node 0-clause ( 88101 choices)\n", "node 1-phrase (253187 choices)\n", "node 2-phrase (253187 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-clause ( 88101 choices)\n", "edge 0-clause := 2-phrase ( 1.0 choices)\n", "edge 2-phrase ]] 0-clause ( 1.0 choices)\n", "edge 2-phrase :> 1-phrase ( 1.0 choices)\n", "edge 1-phrase =: 0-clause ( 0.2 choices)\n", "edge 1-phrase ]] 0-clause ( 1.0 choices)\n", " 0.25s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 clause\n", " 2 R1 =: phrase\n", " 3 R2 <: phrase\n", " 4 :=\n", " 5 \n", " 1.20s Done: found 23483\n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)\n", "qresults = sorted(r[0] for r in S.fetch())\n", "info(f'Done: found {len(qresults)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us check this with a piece of hand-written code." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s counting ...\n", " 1.55s Done: found 23399\n" ] } ], "source": [ "indent(reset=True)\n", "info('counting ...')\n", "\n", "cresults = []\n", "for c in F.otype.s('clause'):\n", " wc = L.d(c, otype='word')\n", " ps = L.d(c, otype='phrase')\n", " if len(ps) == 2:\n", " (fp, lp) = ps\n", " wf = L.d(fp, otype='word')\n", " wl = L.d(lp, otype='word')\n", " if wf[0] == wc[0] and wf[-1] + 1 == wl[0] and wl[-1] == wc[-1]:\n", " cresults.append(c)\n", "cresults = sorted(cresults)\n", "info(f'Done: found {len(cresults)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Strange, we end up with less cases. What is happening? Let us compare the results.\n", "We look at the first result where both methods diverge." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "23119 differences\n", "(428692, 428697)\n" ] } ], "source": [ "diff = [x for x in zip(qresults, cresults) if x[0] != x[1]]\n", "print(f'{len(diff)} differences')\n", "print(diff[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the phrases of the first difference:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Phrase 655060 has words [6514]\n", "Phrase 655061 has words [6515, 6516, 6517, 6518, 6519, 6520, 6522, 6523, 6524, 6525, 6526, 6527, 6528, 6529, 6530]\n", "Phrase 655062 has words [6521]\n" ] } ], "source": [ "for p in L.d(diff[0][0], otype='phrase'):\n", " print(f'Phrase {p} has words {L.d(p, otype=\"word\")}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This clause has three phrases, but the third one lies inside the second one,\n", "so that the clause indeed satisfies the pattern of two adjacent phrases.\n", "\n", "Can we adjust the pattern to exclude cases like this? \n", "At the moment, our search template mechanism is not powerful enough for that.\n", "\n", "We can count how often it happens, however. \n", "We require a third phrase to be present, not equal to one of the first two ones." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "clause\n", " =: p1:phrase\n", " <: p2:phrase\n", " :=\n", " p3:phrase\n", " p1 # p3\n", " p2 # p3\n", "'''" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.02s Setting up search space for 4 objects ...\n", " 0.24s Constraining search space with 8 relations ...\n", " 0.27s Setting up retrieval plan ...\n", " 0.32s Ready to deliver results from 847662 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", " 0.32s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 clause\n", " 2 R1 =: p1:phrase\n", " 3 R2 <: p2:phrase\n", " 4 :=\n", " 5 R3 p3:phrase\n", " 6 p1 # p3\n", " 7 p2 # p3\n", " 8 \n", " 1.32s Done: found 118\n" ] } ], "source": [ "S.study(query)\n", "S.showPlan()\n", "rresults = sorted(r[0] for r in S.fetch())\n", "info(f'Done: found {len(rresults)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we have to filter this, because per `p1`, `p2` there might be multiple `p3` that satisfy the constraints.\n", "So lets gather the set of `p1`, `p2` pairs." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "84" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(rresults))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this is exactly the difference between \n", "the number of results of the search template and the hand-written piece of code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing basic relations\n", "\n", "Basic relations are about the identity spatial ordering of objects.\n", "Are they the same, do they occupy the same slots, do they overlap, is one embedded in the other,\n", "does one come before the other?\n", "\n", "We also have edge features, that specify relationships between nodes.\n", "\n", "Although the basic relationships are easy to define, and even easy to implement,\n", "they may be very costly to use. \n", "When searching, most of them have to be computed very many times.\n", "\n", "Some of them have been precomputed and stored in an index, e.g. the embedding relationships.\n", "They can be used without penalty.\n", "\n", "Other relations are not suitable for pre-computing: most inequality relations are of that kind.\n", "It would require an enormous amount of storage to pre-compute for each node the set of nodes that\n", "occupy different slots. This type of relation will not be used in narrowing down the search space,\n", "which means that it may take more time to get the results.\n", "\n", "We are going to test all of our basic relationships here.\n", "\n", "Let us first see what relations we have:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " = left equal to right (as node)\n", " # left unequal to right (as node)\n", " < left before right (in canonical node ordering)\n", " > left after right (in canonical node ordering)\n", " == left occupies same slots as right\n", " && left has overlapping slots with right\n", " ## left and right do not have the same slot set\n", " || left and right do not have common slots\n", " [[ left embeds right\n", " ]] left embedded in right\n", " << left completely before right\n", " >> left completely after right\n", " =: left and right start at the same slot\n", " := left and right end at the same slot\n", " :: left and right start and end at the same slot\n", " <: left immediately before right\n", " :> left immediately after right\n", " =k: left and right start at k-nearly the same slot\n", " :k= left and right end at k-nearly the same slot\n", " :k: left and right start and end at k-near slots\n", " left k-nearly after right\n", "-distributional_parent> edge feature \"distributional_parent\"\n", " edge feature \"functional_parent\"\n", " edge feature \"mother\"\n", " edge feature \"omap@2016-2017\"\n", " `, which will constrain the results better.\n", "\n", "We have seen this in action in the search for gapped phrases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# < and > (canonical)\n", "\n", "`n < m` if `n` comes before `m` in the\n", "[canonical ordering](https://github.com/Dans-labs/text-fabric/wiki/Api#sorting-nodes)\n", "of nodes.\n", "\n", "We have seen them in action before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# == (same slots)\n", "\n", "Two objects are extensionally equal if they occupy exactly the same slots.\n", "\n", "Quite an expensive relation, as you will see: nearly 30 seconds for 3608 results." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Isaiah 52:14 sentence[כַּאֲשֶׁ֨ר שָׁמְמ֤וּ עָלֶ֨יךָ֙ רַבִּ֔ים ...]\n", "Proverbs 23:34 sentence[וְ֭הָיִיתָ כְּשֹׁכֵ֣ב בְּ...]\n", "Proverbs 24:4 sentence[וּ֭בְדַעַת חֲדָרִ֣ים יִמָּלְא֑וּ ...]\n", "Leviticus 12:1 sentence[וַיְדַבֵּ֥ר יְהוָ֖ה אֶל־מֹשֶׁ֥ה ...]\n", "Leviticus 12:3 sentence[וּבַיֹּ֖ום הַ...]\n", "Proverbs 24:8 sentence[מְחַשֵּׁ֥ב לְהָרֵ֑עַ לֹ֝֗ו בַּֽעַל־...]\n", "Leviticus 12:6 sentence[וּבִמְלֹ֣את׀ יְמֵ֣י טָהֳרָ֗הּ ...]\n", "Leviticus 13:1 sentence[וַיְדַבֵּ֣ר יְהוָ֔ה אֶל־מֹשֶׁ֥ה ...]\n", "Isaiah 54:12 sentence[וְשַׂמְתִּ֤י כַּֽדְכֹד֙ שִׁמְשֹׁתַ֔יִךְ וּ...]\n", "Leviticus 13:9 sentence[נֶ֣גַע צָרַ֔עַת כִּ֥י תִהְיֶ֖ה בְּ...]\n", " 0.00s Counting results per 1000 up to 10000 ...\n", " | 6.46s 1000\n", " | 13s 2000\n", " | 19s 3000\n", " 23s Done: 3601 results\n" ] } ], "source": [ "query = '''\n", "v:verse\n", " s:sentence\n", "v == s\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))\n", "S.count(progress=1000, limit=10000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# && (overlap)\n", "\n", "Two objects overlap if and only if they share at least one slot.\n", "This is quite costly to use in some cases." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ] subphrase[לְאֹתֹת֙ ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ] subphrase[לְמֹ֣ועֲדִ֔ים ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[לְאֹתֹת֙ ] subphrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[לְמֹ֣ועֲדִ֔ים ] subphrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[לְיָמִ֖ים וְשָׁנִֽים׃ ] subphrase[יָמִ֖ים ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[לְיָמִ֖ים וְשָׁנִֽים׃ ] subphrase[שָׁנִֽים׃ ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[יָמִ֖ים ] subphrase[לְיָמִ֖ים וְשָׁנִֽים׃ ]\n", "Genesis 1:14 phrase[לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים ...] subphrase[שָׁנִֽים׃ ] subphrase[לְיָמִ֖ים וְשָׁנִֽים׃ ]\n", "Genesis 1:16 phrase[אֶת־שְׁנֵ֥י הַמְּאֹרֹ֖ת הַ...] subphrase[הַמְּאֹרֹ֖ת הַגְּדֹלִ֑ים ] subphrase[הַגְּדֹלִ֑ים ]\n", "Genesis 1:16 phrase[אֶת־שְׁנֵ֥י הַמְּאֹרֹ֖ת הַ...] subphrase[הַמְּאֹרֹ֖ת ] subphrase[הַמְּאֹרֹ֖ת הַגְּדֹלִ֑ים ]\n" ] } ], "source": [ "query = '''\n", "verse\n", " phrase\n", " s1:subphrase\n", " s2:subphrase\n", " s1 # s2\n", " s1 && s2\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ## (not the same slots)\n", "\n", "True when the two objects in question do not occupy exactly the same set of slots.\n", "This is a very loose relationship.\n", "\n", "It is implemented, but not yet tested, and at the moment I have not a clear use case for it.\n", "\n", "# || (disjoint slots)\n", "\n", "True when the two objects in question do not share any slots.\n", "This is a rather loose relationship.\n", "\n", "This cab be used for locating gaps: a textual object that lies inside a gap of another object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# [[ and ]] (embedding)\n", "\n", "`n [[ m` if object `n` embeds `m`.\n", "\n", "`n ]] m` if object `n` lies embedded in `n`.\n", "\n", "These relations are used implicitly in templates when there is indentation:\n", "\n", "```\n", "s:sentence\n", " p:phrase\n", " w1:word gn=f\n", " w2:word gn=m\n", "```\n", "\n", "The template above implicitly states the following embeddings:\n", "\n", "* `s ]] p`\n", "* `p ]] w1`\n", "* `p ]] w2`\n", "\n", "We have seen these relations in action." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# << and >> (before and after with slots)\n", "\n", "These relations test whether one object comes before or after an other,\n", "in the sense that the slots\n", "occupied by the one object lie completely \n", "before or after the slots occupied by the other object." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:11 sentence[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] phrase[עֹ֤שֶׂה ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:11 sentence[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] phrase[פְּרִי֙ ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:11 sentence[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] phrase[לְמִינֹ֔ו ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:11 sentence[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause[מַזְרִ֣יעַ זֶ֔רַע ] phrase[עֹ֤שֶׂה ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:11 sentence[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause[מַזְרִ֣יעַ זֶ֔רַע ] phrase[פְּרִי֙ ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:11 sentence[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause[מַזְרִ֣יעַ זֶ֔רַע ] phrase[לְמִינֹ֔ו ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:12 sentence[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] phrase[עֹ֥שֶׂה ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:12 sentence[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] phrase[פְּרִ֛י ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:12 sentence[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause[מַזְרִ֤יעַ זֶ֨רַע֙ לְמִינֵ֔הוּ ] phrase[עֹ֥שֶׂה ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n", "Genesis 1:12 sentence[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause[מַזְרִ֤יעַ זֶ֨רַע֙ לְמִינֵ֔הוּ ] phrase[פְּרִ֛י ] clause[אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו ]\n" ] } ], "source": [ "query = '''\n", "verse\n", " sentence\n", " c1:clause\n", " p:phrase\n", " c2:clause\n", " c1 << p\n", " c2 >> p\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# =: (start at same slots)\n", "This relation holds when the left and right hand sides are nodes that have the same first slot.\n", "It serves to enforce the the children of a parent are textually the first things inside that\n", "parent. We have seen it in action before.\n", "\n", "# := (end at same slots)\n", "This relation holds when the left and right hand sides are nodes that have the same last slot\n", "It serves to enforce the the children of a parent are textually the last things inside that\n", "parent. We have seen it in action before.\n", "\n", "# :: (same start and end slots)\n", "This relation holds when `=:` and `:=` both hold between the left and right hand sides.\n", "It serves to look for parents with single children, or at least, where the parent is textually spanned by a single child." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:5 clause[יֹ֥ום אֶחָֽד׃ פ ] phrase[יֹ֥ום אֶחָֽד׃ פ ]\n", "Genesis 1:8 clause[יֹ֥ום שֵׁנִֽי׃ פ ] phrase[יֹ֥ום שֵׁנִֽי׃ פ ]\n", "Genesis 1:13 clause[יֹ֥ום שְׁלִישִֽׁי׃ פ ] phrase[יֹ֥ום שְׁלִישִֽׁי׃ פ ]\n", "Genesis 1:19 clause[יֹ֥ום רְבִיעִֽי׃ פ ] phrase[יֹ֥ום רְבִיעִֽי׃ פ ]\n", "Genesis 1:22 clause[לֵאמֹ֑ר ] phrase[לֵאמֹ֑ר ]\n", "Genesis 1:22 clause[פְּר֣וּ ] phrase[פְּר֣וּ ]\n", "Genesis 1:23 clause[יֹ֥ום חֲמִישִֽׁי׃ פ ] phrase[יֹ֥ום חֲמִישִֽׁי׃ פ ]\n", "Genesis 1:28 clause[פְּר֥וּ ] phrase[פְּר֥וּ ]\n", "Genesis 1:31 clause[יֹ֥ום הַשִּׁשִּֽׁי׃ פ ] phrase[יֹ֥ום הַשִּׁשִּֽׁי׃ פ ]\n", "Genesis 2:3 clause[לַעֲשֹֽׂות׃ פ ] phrase[לַעֲשֹֽׂות׃ פ ]\n", " 0.00s Counting results per 1000 up to the end of the results ...\n", " | 0.10s 1000\n", " | 0.19s 2000\n", " | 0.27s 3000\n", " | 0.34s 4000\n", " | 0.38s 5000\n", " | 0.44s 6000\n", " | 0.48s 7000\n", " | 0.54s 8000\n", " | 0.60s 9000\n", " 0.65s Done: 9451 results\n" ] } ], "source": [ "query = '''\n", "verse\n", " clause\n", " :: phrase\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))\n", "S.count(progress=1000, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like before, there might be extra phrases in such clauses, lying embedded in the clause-spanning phrase." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 10:21 clause[גַּם־ה֑וּא אֲבִי֙ כָּל־בְּנֵי־...] phrase[גַּם־ה֑וּא אֲחִ֖י יֶ֥פֶת הַ...] phrase[אֲבִי֙ כָּל־בְּנֵי־עֵ֔בֶר ]\n", "Genesis 24:24 clause[בַּת־בְּתוּאֵ֖ל אָנֹ֑כִי בֶּן־מִלְכָּ֕ה ] phrase[בַּת־בְּתוּאֵ֖ל בֶּן־מִלְכָּ֕ה ] phrase[אָנֹ֑כִי ]\n", "Genesis 31:16 clause[לָ֥נוּ ה֖וּא וּלְבָנֵ֑ינוּ ] phrase[לָ֥נוּ וּלְבָנֵ֑ינוּ ] phrase[ה֖וּא ]\n", "Genesis 31:53 clause[אֱלֹהֵ֨י אַבְרָהָ֜ם וֵֽאלֹהֵ֤י נָחֹור֙ ...] phrase[אֱלֹהֵ֨י אַבְרָהָ֜ם וֵֽאלֹהֵ֤י נָחֹור֙ ...] phrase[יִשְׁפְּט֣וּ ]\n", "Genesis 31:53 clause[אֱלֹהֵ֨י אַבְרָהָ֜ם וֵֽאלֹהֵ֤י נָחֹור֙ ...] phrase[אֱלֹהֵ֨י אַבְרָהָ֜ם וֵֽאלֹהֵ֤י נָחֹור֙ ...] phrase[בֵינֵ֔ינוּ ]\n", "Exodus 28:1 clause[לְכַהֲנֹו־לִ֑י אַהֲרֹ֕ן נָדָ֧ב ...] phrase[לְכַהֲנֹו־אַהֲרֹ֕ן נָדָ֧ב וַ...] phrase[לִ֑י ]\n", "Exodus 28:14 clause[מִגְבָּלֹ֛ת תַּעֲשֶׂ֥ה אֹתָ֖ם מַעֲשֵׂ֣ה עֲבֹ֑ת ] phrase[מִגְבָּלֹ֛ת מַעֲשֵׂ֣ה עֲבֹ֑ת ] phrase[תַּעֲשֶׂ֥ה ]\n", "Exodus 28:14 clause[מִגְבָּלֹ֛ת תַּעֲשֶׂ֥ה אֹתָ֖ם מַעֲשֵׂ֣ה עֲבֹ֑ת ] phrase[מִגְבָּלֹ֛ת מַעֲשֵׂ֣ה עֲבֹ֑ת ] phrase[אֹתָ֖ם ]\n", "Exodus 29:18 clause[עֹלָ֥ה ה֖וּא לַֽיהוָ֑ה רֵ֣יחַ ...] phrase[עֹלָ֥ה רֵ֣יחַ נִיחֹ֔וחַ ] phrase[ה֖וּא ]\n", "Exodus 29:18 clause[עֹלָ֥ה ה֖וּא לַֽיהוָ֑ה רֵ֣יחַ ...] phrase[עֹלָ֥ה רֵ֣יחַ נִיחֹ֔וחַ ] phrase[לַֽיהוָ֑ה ]\n", " 0.00s Counting results per 1000 up to the end of the results ...\n", " 0.57s Done: 80 results\n" ] } ], "source": [ "query = '''\n", "verse\n", " clause\n", " :: p1:phrase\n", " p2:phrase\n", " p1 # p2\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))\n", "S.count(progress=1000, limit=-1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# <: (adjacent before) \n", "This relation holds when the left hand sides ends in a slot that lies before the first slot of the right hand side.\n", "It serves to enforce an ordering between siblings of a parent.\n", "\n", "# :> (adjacent after)\n", "This relation holds when the left hand sides starts in a slot that lies after the last slot of the right hand side.\n", "\n", "As an example: are there clauses with multiple clause atoms without a gap between the two?" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1000 up to the end of the results ...\n", " 0.78s Done: 0 results\n" ] } ], "source": [ "query = '''\n", "verse\n", " clause\n", " clause_atom\n", " <: clause_atom\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))\n", "S.count(progress=1000, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion: there is always textual material between clause_atoms of the same clause.\n", "If we lift the adjacency to sequentially before (`<<`) we do get results:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:7 clause[וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ ...] clause_atom[וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ ] clause_atom[וּבֵ֣ין הַמַּ֔יִם ]\n", "Genesis 1:11 clause[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause_atom[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ] clause_atom[עֵ֣ץ פְּרִ֞י ]\n", "Genesis 1:11 clause[עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו עַל־...] clause_atom[עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו ] clause_atom[עַל־הָאָ֑רֶץ ]\n", "Genesis 1:12 clause[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause_atom[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause_atom[וְעֵ֧ץ ]\n", "Genesis 1:12 clause[עֹ֥שֶׂה פְּרִ֛י לְמִינֵ֑הוּ ] clause_atom[עֹ֥שֶׂה פְּרִ֛י ] clause_atom[לְמִינֵ֑הוּ ]\n", "Genesis 1:21 clause[וַיִּבְרָ֣א אֱלֹהִ֔ים אֶת־הַ...] clause_atom[וַיִּבְרָ֣א אֱלֹהִ֔ים אֶת־הַ...] clause_atom[לְמִֽינֵהֶ֗ם ]\n", "Genesis 1:29 clause[הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־...] clause_atom[הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־...] clause_atom[וְאֶת־כָּל־הָעֵ֛ץ ]\n", "Genesis 1:30 clause[וּֽלְכָל־חַיַּ֣ת הָ֠...] clause_atom[וּֽלְכָל־חַיַּ֣ת הָ֠...] clause_atom[אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְ...]\n", "Genesis 2:17 clause[כִּ֗י בְּיֹ֛ום מֹ֥ות תָּמֽוּת׃ ] clause_atom[כִּ֗י בְּיֹ֛ום ] clause_atom[מֹ֥ות תָּמֽוּת׃ ]\n", "Genesis 2:22 clause[וַיִּבֶן֩ יְהוָ֨ה אֱלֹהִ֧ים׀ אֶֽת־...] clause_atom[וַיִּבֶן֩ יְהוָ֨ה אֱלֹהִ֧ים׀ אֶֽת־...] clause_atom[לְאִשָּׁ֑ה ]\n", " 0.00s Counting results per 1000 up to the end of the results ...\n", " | 0.28s 1000\n", " | 0.59s 2000\n", " 0.73s Done: 2589 results\n" ] } ], "source": [ "query = '''\n", "verse\n", " clause\n", " clause_atom\n", " << clause_atom\n", "'''\n", "for r in S.search(query, limit=10): print(S.glean(r))\n", "S.count(progress=1000, limit=-1)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Nearness for := =: :: :> <:\n", "\n", "The relations with `:` in their name always have a requirement somewhere that a slot of the\n", "left hand node equals a slot of the right hand node, or that the two are adjacent.\n", "\n", "All these relationships can be relaxed by a **nearness number**.\n", "If you put a number *k* inside the relationship symbols, those restrictions will be relaxed to\n", "*the one slot and the other slot should have a mutual distance of at most k*.\n", "\n", "Here is an example.\n", "\n", "First we look for clauses, with a phrase in it that starts at the\n", "same slot as the clause." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to the end of the results ...\n", " | 0.00s 100\n", " 0.00s Done: 126 results\n" ] } ], "source": [ "S.study('''\n", "chapter book=Genesis chapter=1\n", " clause\n", " =: phrase\n", "''', silent=True)\n", "S.count(progress=100, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we add a bit of freedom, but not much: 0. Indeed, this is no extra\n", "freedom, and it should give the same number of results." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to the end of the results ...\n", " | 0.00s 100\n", " 0.00s Done: 126 results\n" ] } ], "source": [ "S.study('''\n", "chapter book=Genesis chapter=1\n", " clause\n", " =0: phrase\n", "''', silent=True)\n", "S.count(progress=100, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we add real freedom: 1 and 2" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to the end of the results ...\n", " | 0.00s 100\n", " | 0.00s 200\n", " 0.01s Done: 236 results\n" ] } ], "source": [ "S.study('''\n", "chapter book=Genesis chapter=1\n", " clause\n", " =1: phrase\n", "''', silent=True)\n", "S.count(progress=100, limit=-1)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to the end of the results ...\n", " | 0.00s 100\n", " | 0.01s 200\n", " | 0.02s 300\n", " 0.02s Done: 315 results\n" ] } ], "source": [ "S.study('''\n", "chapter book=Genesis chapter=1\n", " clause\n", " =2: phrase\n", "''', silent=True)\n", "S.count(progress=100, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us see some cases:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " clause[בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת ...] phrase[בָּרָ֣א ]\n", " clause[בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת ...] phrase[בְּרֵאשִׁ֖ית ]\n", " clause[וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ ...] phrase[וְ]\n", " clause[וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ ...] phrase[הָאָ֗רֶץ ]\n", " clause[וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום ] phrase[חֹ֖שֶׁךְ ]\n", " clause[וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום ] phrase[עַל־פְּנֵ֣י תְהֹ֑ום ]\n", " clause[וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום ] phrase[וְ]\n", " clause[וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־...] phrase[וְ]\n", " clause[וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־...] phrase[ר֣וּחַ אֱלֹהִ֔ים ]\n", " clause[וַיֹּ֥אמֶר אֱלֹהִ֖ים ] phrase[אֱלֹהִ֖ים ]\n" ] } ], "source": [ "for r in S.fetch(limit=10): print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first and second result show the same clause, with its first and second phrase respectively.\n", "\n", "Note that we look for phrases that lie embedded in their clause.\n", "So we do not get phrases of a preceding clause.\n", "\n", "But if we want, we can get those as well." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 100 up to the end of the results ...\n", " | 0.00s 100\n", " | 0.00s 200\n", " | 0.00s 300\n", " | 0.01s 400\n", " 0.01s Done: 485 results\n" ] } ], "source": [ "S.study('''\n", "chapter book=Genesis chapter=1\n", " c:clause\n", " p:phrase\n", " \n", " c =2: p\n", "''', silent=True)\n", "S.count(progress=100, limit=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have more results now. Here is a closer look:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:3 clause[וַיֹּ֥אמֶר אֱלֹהִ֖ים ] phrase[אֱלֹהִ֖ים ]\n", "Genesis 1:3 clause[וַיֹּ֥אמֶר אֱלֹהִ֖ים ] phrase[וַ]\n", "Genesis 1:3 clause[וַיֹּ֥אמֶר אֱלֹהִ֖ים ] phrase[יֹּ֥אמֶר ]\n", "Genesis 1:3 clause[יְהִ֣י אֹ֑ור ] phrase[אֱלֹהִ֖ים ]\n", "Genesis 1:3 clause[יְהִ֣י אֹ֑ור ] phrase[יְהִ֣י ]\n", "Genesis 1:3 clause[יְהִ֣י אֹ֑ור ] phrase[אֹ֑ור ]\n", "Genesis 1:3 clause[יְהִ֣י אֹ֑ור ] phrase[וַֽ]\n", "Genesis 1:3 clause[יְהִ֣י אֹ֑ור ] phrase[יֹּ֥אמֶר ]\n", "Genesis 1:3 clause[וַֽיְהִי־אֹֽור׃ ] phrase[יְהִ֣י ]\n", "Genesis 1:3 clause[וַֽיְהִי־אֹֽור׃ ] phrase[אֹ֑ור ]\n", "Genesis 1:3 clause[וַֽיְהִי־אֹֽור׃ ] phrase[וַֽ]\n", "Genesis 1:3 clause[וַֽיְהִי־אֹֽור׃ ] phrase[יְהִי־]\n", "Genesis 1:3 clause[וַֽיְהִי־אֹֽור׃ ] phrase[אֹֽור׃ ]\n", " 0.00s Counting results per 100 up to the end of the results ...\n", " 0.00s Done: 13 results\n" ] } ], "source": [ "for r in S.search('''\n", "verse book=Genesis chapter=1 verse=3\n", " c:clause\n", " p:phrase\n", " \n", " c =2: p\n", "''', limit=100): print(S.glean(r))\n", "\n", "S.count(progress=100, limit=-1) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see in result 4 a phrase of the previous clause in the result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gaps\n", "\n", "A question raised by Cody Kingham: **gaps**!\n", "\n", "Search has no direct primitives to deal with gaps.\n", "For example, the MQL query\n", "```\n", "SELECT ALL OBJECTS WHERE\n", "\n", "[phrase FOCUS\n", " [word lex='L']\n", " [gap]\n", "]\n", "```\n", "looks for a phrase with a gap in it\n", "(i.e. one or more consecutive words between the start and the end of the phrase\n", "that do not belong to the phrase).\n", "The query then asks additionally for those gap-containing phrases that have a certain word in front of the gap.\n", "\n", "Yet we can mimick this query in Search.\n", "\n", "## Find the gap" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "p:phrase\n", " =: wFirst:word\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wFirst < wGap\n", "wGap < wLast\n", "wGap || p\n", "'''" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 4 objects ...\n", " 0.38s Constraining search space with 7 relations ...\n", " 0.41s Setting up retrieval plan ...\n", " 0.47s Ready to deliver results from 1532939 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 4 objects and 7 relations\n", "Results are instantiations of the following objects:\n", "node 0-phrase (253187 choices)\n", "node 1-word (426584 choices)\n", "node 2-word (426584 choices)\n", "node 3-word (426584 choices)\n", "Instantiations are computed along the following relations:\n", "node 0-phrase (253187 choices)\n", "edge 0-phrase := 2-word ( 1.0 choices)\n", "edge 2-word ]] 0-phrase ( 1.0 choices)\n", "edge 0-phrase =: 1-word ( 1.0 choices)\n", "edge 1-word ]] 0-phrase ( 1.0 choices)\n", "edge 2-word > 3-word (213292.0 choices)\n", "edge 1-word < 3-word (213292.0 choices)\n", "edge 3-word || 0-phrase (227868.3 choices)\n", " 2.65s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 p:phrase\n", " 2 R1 =: wFirst:word\n", " 3 R2 wLast:word\n", " 4 :=\n", " 5 \n", " 6 R3 wGap:word\n", " 7 wFirst < wGap\n", " 8 wGap < wLast\n", " 9 wGap || p\n", "10 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 2 up to 20 ...\n", " | 8.26s 2\n", " | 8.26s 4\n", " | 8.26s 6\n", " | 15s 8\n", " | 17s 10\n", " | 17s 12\n", " | 46s 14\n", " | 46s 16\n", " | 46s 18\n", " | 46s 20\n", " 46s Done: 20 results\n" ] } ], "source": [ "S.count(progress=2, limit=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is not a fast query, to say the least.\n", "Let's add an additional constraint, and see whether it goes faster." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "query = '''\n", "verse\n", " p:phrase\n", " =: wFirst:word\n", " wBefore:word lex=L\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wFirst < wGap\n", "wGap < wLast\n", "p || wGap\n", "wBefore <: wGap\n", "'''" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 6 objects ...\n", " 0.97s Constraining search space with 10 relations ...\n", " 1.01s Setting up retrieval plan ...\n", " 1.08s Ready to deliver results from 1576599 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 6 objects and 10 relations\n", "Results are instantiations of the following objects:\n", "node 0-verse ( 23213 choices)\n", "node 1-phrase (253187 choices)\n", "node 2-word (426584 choices)\n", "node 3-word ( 20447 choices)\n", "node 4-word (426584 choices)\n", "node 5-word (426584 choices)\n", "Instantiations are computed along the following relations:\n", "node 3-word ( 20447 choices)\n", "edge 3-word <: 5-word ( 1.0 choices)\n", "edge 3-word ]] 1-phrase ( 1.0 choices)\n", "edge 5-word || 1-phrase (227868.3 choices)\n", "edge 1-phrase ]] 0-verse ( 1.0 choices)\n", "edge 1-phrase := 4-word ( 1.0 choices)\n", "edge 4-word ]] 1-phrase ( 1.0 choices)\n", "edge 5-word < 4-word (213292.0 choices)\n", "edge 1-phrase =: 2-word ( 1.0 choices)\n", "edge 2-word ]] 1-phrase ( 1.0 choices)\n", "edge 2-word < 5-word (213292.0 choices)\n", " 3.92s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 verse\n", " 2 R1 p:phrase\n", " 3 R2 =: wFirst:word\n", " 4 R3 wBefore:word lex=L\n", " 5 R4 wLast:word\n", " 6 :=\n", " 7 \n", " 8 R5 wGap:word\n", " 9 wFirst < wGap\n", "10 wGap < wLast\n", "11 p || wGap\n", "12 wBefore <: wGap\n", "13 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 10 up to 1000 ...\n", " | 0.17s 10\n", " 0.23s Done: 13 results\n" ] } ], "source": [ "S.count(progress=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is much quicker.\n", "Let's see the results." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Leviticus 25:6 phrase[לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ ...] לָכֶם֙ לָכֶם֙ תֹושָׁ֣בְךָ֔ לְ\n", "Genesis 17:7 phrase[לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ ] לְךָ֙ לְךָ֙ אַחֲרֶֽיךָ׃ לֵֽ\n", "Deuteronomy 26:11 phrase[לְךָ֛ וּלְבֵיתֶ֑ךָ ] לְךָ֛ לְךָ֛ בֵיתֶ֑ךָ יְהוָ֥ה \n", "Exodus 30:21 phrase[לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו ] לָהֶ֧ם לָהֶ֧ם זַרְעֹ֖ו חָק־\n", "Genesis 28:4 phrase[לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ ...] לְךָ֙ לְךָ֙ אִתָּ֑ךְ אֶת־\n", "2_Kings 25:24 phrase[לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם ] לָהֶ֤ם לָהֶ֤ם אַנְשֵׁיהֶ֔ם גְּדַלְיָ֨הוּ֙ \n", "Daniel 9:8 phrase[לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ ...] לָ֚נוּ לָ֚נוּ אֲבֹתֵ֑ינוּ בֹּ֣שֶׁת \n", "Genesis 31:16 phrase[לָ֥נוּ וּלְבָנֵ֑ינוּ ] לָ֥נוּ לָ֥נוּ בָנֵ֑ינוּ ה֖וּא \n", "Numbers 20:15 phrase[לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃ ] לָ֛נוּ לָ֛נוּ אֲבֹתֵֽינוּ׃ מִצְרַ֖יִם \n", "Numbers 32:33 phrase[לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְ...] לָהֶ֣ם׀ לָהֶ֣ם׀ יֹוסֵ֗ף מֹשֶׁ֡ה \n", "1_Samuel 25:31 phrase[לְךָ֡ לַאדֹנִ֗י ] לְךָ֡ לְךָ֡ אדֹנִ֗י לְ\n", "Jeremiah 40:9 phrase[לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם ] לָהֶ֜ם לָהֶ֜ם אַנְשֵׁיהֶ֣ם גְּדַלְיָ֨הוּ \n", "Deuteronomy 1:36 phrase[לֹֽו־וּלְבָנָ֑יו ] לֹֽו־ לֹֽו־ בָנָ֑יו אֶתֵּ֧ן \n" ] } ], "source": [ "for r in S.fetch(): print(S.glean(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Go to SHEBANQ to inspect\n", "[Genesis 17:7](https://shebanq.ancient-data.org/hebrew/text?qactive=hlcustom&qsel_one=white&qpub=x&qget=v&wactive=hlcustom&wsel_one=black&wpub=x&wget=v&nactive=hlcustom&nsel_one=black&npub=x&nget=v&chapter=17&lang=en&book=Genesis&qw=n&tr=hb&tp=txt_p&iid=Mnx2YWxlbmNl&verse=1&version=4b&mr=m&page=1&c_q1510=turquoise&c_w1BRAv=yellow&wd4_statfl=v&ph_arela=x&wd4_statrl=v&sn_an=x&cl=x&wd1_lang=x&wd1_subpos=x&wd2_person=v&sp_rela=v&wd1_pdp=x&sn_n=v&wd3_uvf=x&ph_fun=x&wd1_nmtp=x&gl=x&sp_n=v&pt=x&ph_an=v&ph_typ=x&cl_typ=x&tt=x&wd4_statro=x&wd3_vbs=x&wd1=v&tl=v&wd3=x&wd4=x&wd2_gender=v&ph=v&wd3_vbe=v&wd1_pos=x&ph_det=x&ph_rela=x&wd4_statfo=x&tl_tlv=x&wd2_stem=v&wd2_state=v&ht=v&ph_n=v&tl_tlc=v&cl_tab=x&wd3_nme=x&hl=x&cl_par=x&cl_an=x&cl_n=v&wd3_prs=v&wd3_pfm=x&sp=x&cl_code=x&ht_hk=x&wd2=x&hl_hlc=x&cl_rela=x&wd2_gnumber=v&wd2_tense=v&cl_txt=x&wd1_n=v&sn=x&ht_ht=v&hl_hlv=v&pref=alt)\n", "and click the verse number to view the verse in data view.\n", "The last phrase looks like this\n", "\n", "![ll](ll.png)\n", "\n", "The number 7431 etc are slot (word) numbers, the numbers 2 and 3 are phrase numbers, relative to the surrounding clause,\n", "and the numbers 4354 are phrase atom numbers, relative to the surrounding book.\n", "\n", "The red bars higlight the spots where phrases get interrupted by other material.\n", "Here we see that phrase 2 get interrupted after word 7432 by phrase 2.\n", "\n", "**Note** that in SHEBANQ you are looking at versions 4 and 4b, while we ran this search against version 2017.\n", "But here the versions agree." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }