<img align="right" src="images/tf-small.png" width="128"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/dans-small.png"/>

You might want to consider the [start](search.ipynb) of this tutorial.

Short introductions to other TF datasets:

* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),
* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),
or the
* [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)


# Sets and queries

You can pass custom sets to the search function, as we have seen in [advanced](searchAdvanced.ipynb).
Now we want to give a real-world example of that, and also show how you can prepare sets for use
in the TF browser.

## Chapters with only "frequent" words

The following task comes from the department of education:

*Find the chapters without more than 20 rare words, where a rare word has a frequency (as lexeme) of less than 70.*

A question posed by Oliver Glanz.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os

In [4]:
from tf.app import use
from tf.lib import writeSets, readSets

In [5]:
A = use("ETCBC/bhsa", hoist=globals())

This is Text-Fabric 9.2.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

122 features found and 0 ignored


In [5]:
FREQ = 70
AMOUNT = 20

## Query

A straightforward query is:

In [6]:
query = f"""
chapter
/without/
  word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
/-/
"""

Several problems with this query:

* it is very inelegant.
* it does not perform, in fact, you cannot wait for it.
* the logic is wasteful: the `/without/` query that expresses what should be left out
  denotes all possible combinations of 20 infrequent words, an astronomical number.

So, better not search with this one.

In [7]:
# A.indent(reset=True)
# A.info('start query')
# results = S.search(query, limit=1)
# A.info('end query')
# len(results)

# By hand

On the other hand, with a bit of hand coding it is very easy, and almost instantaneous:

In [8]:
results = []
allChapters = F.otype.s("chapter")

for chapter in allChapters:
    if (
        len([word for word in L.d(chapter, otype="word") if F.freq_lex.v(word) < FREQ])
        < AMOUNT
    ):
        results.append(chapter)

print(f"{len(results)} chapters out of {len(allChapters)}")

60 chapters out of 929


In [9]:
resultsByBook = dict()

for chapter in results:
    (bk, ch) = T.sectionFromNode(chapter)
    resultsByBook.setdefault(bk, []).append(ch)

for (bk, chps) in resultsByBook.items():
    print("{} {}".format(bk, ", ".join(str(c) for c in chps)))

Exodus 11, 24
Leviticus 17
Deuteronomy 30
Joshua 23
Isaiah 12, 39
Jeremiah 45
Ezekiel 15
Hosea 3
Joel 3
Psalms 1, 3, 4, 13, 14, 15, 20, 23, 24, 26, 43, 47, 53, 54, 61, 67, 70, 82, 86, 87, 93, 97, 99, 100, 101, 110, 113, 114, 115, 117, 120, 121, 122, 123, 124, 125, 126, 127, 128, 130, 131, 133, 134, 136, 138, 150
Job 25
Esther 10
2_Chronicles 27


# Custom sets

Once you have these chapters, you can put them in a set and use them in queries.

We show how to query results as far as they occur in an "ordinary" chapter.

First we search for a phenomenon in all chapters. The phenomenon is a clause with a subject consisting of a single noun in
the plural and a verb in the plural.

In [10]:
sets = dict(ochapter=set(results))

In [11]:
query1 = """
verse
  clause
    phrase function=Pred
      word pdp=verb nu=sg
    phrase function=Subj
      =: word pdp=subs nu=pl
      :=
"""

In [12]:
results1 = A.search(query1)

  1.58s 262 results


In [13]:
A.table(results1, start=1, end=5, skipCols="1")

n,p,clause,phrase,word,phrase.1,word.1
1,Genesis 1:1,◊ë÷∞÷º◊®÷µ◊ê◊©÷¥◊Å÷ñ◊ô◊™ ◊ë÷∏÷º◊®÷∏÷£◊ê ◊ê÷±◊ú÷π◊î÷¥÷ë◊ô◊ù ◊ê÷µ÷•◊™ ◊î÷∑◊©÷∏÷º◊Å◊û÷∑÷ñ◊ô÷¥◊ù ◊ï÷∞◊ê÷µ÷•◊™ ◊î÷∏◊ê÷∏÷Ω◊®÷∂◊•◊É,◊ë÷∏÷º◊®÷∏÷£◊ê,◊ë÷∏÷º◊®÷∏÷£◊ê,◊ê÷±◊ú÷π◊î÷¥÷ë◊ô◊ù,◊ê÷±◊ú÷π◊î÷¥÷ë◊ô◊ù
2,Genesis 1:3,◊ï÷∑◊ô÷π÷º÷•◊ê◊û÷∂◊® ◊ê÷±◊ú÷π◊î÷¥÷ñ◊ô◊ù,◊ô÷π÷º÷•◊ê◊û÷∂◊®,◊ô÷π÷º÷•◊ê◊û÷∂◊®,◊ê÷±◊ú÷π◊î÷¥÷ñ◊ô◊ù,◊ê÷±◊ú÷π◊î÷¥÷ñ◊ô◊ù
3,Genesis 1:4,◊ï÷∑◊ô÷∑÷º÷ß◊®÷∞◊ê ◊ê÷±◊ú÷π◊î÷¥÷õ◊ô◊ù ◊ê÷∂◊™÷æ◊î÷∏◊ê÷π÷ñ◊ï◊®,◊ô÷∑÷º÷ß◊®÷∞◊ê,◊ô÷∑÷º÷ß◊®÷∞◊ê,◊ê÷±◊ú÷π◊î÷¥÷õ◊ô◊ù,◊ê÷±◊ú÷π◊î÷¥÷õ◊ô◊ù
4,Genesis 1:4,◊ï÷∑◊ô÷∑÷º◊ë÷∞◊ì÷µ÷º÷£◊ú ◊ê÷±◊ú÷π◊î÷¥÷î◊ô◊ù ◊ë÷µ÷º÷•◊ô◊ü ◊î÷∏◊ê÷π÷ñ◊ï◊® ◊ï÷º◊ë÷µ÷•◊ô◊ü ◊î÷∑◊ó÷π÷Ω◊©÷∂◊Å◊ö÷∞◊É,◊ô÷∑÷º◊ë÷∞◊ì÷µ÷º÷£◊ú,◊ô÷∑÷º◊ë÷∞◊ì÷µ÷º÷£◊ú,◊ê÷±◊ú÷π◊î÷¥÷î◊ô◊ù,◊ê÷±◊ú÷π◊î÷¥÷î◊ô◊ù
5,Genesis 1:5,◊ï÷∑◊ô÷¥÷º◊ß÷∞◊®÷∏÷®◊ê ◊ê÷±◊ú÷π◊î÷¥÷§◊ô◊ù◊Ä ◊ú÷∏◊ê÷π◊ï◊®÷ô ◊ô÷π÷î◊ï◊ù,◊ô÷¥÷º◊ß÷∞◊®÷∏÷®◊ê,◊ô÷¥÷º◊ß÷∞◊®÷∏÷®◊ê,◊ê÷±◊ú÷π◊î÷¥÷§◊ô◊ù◊Ä,◊ê÷±◊ú÷π◊î÷¥÷§◊ô◊ù◊Ä


Now we want to restrict results to ordinary chapters:

In [14]:
query2 = """
ochapter
  verse
    clause
      phrase function=Pred
        word pdp=verb nu=sg
      phrase function=Subj
        =: word pdp=subs nu=pl
        :=
"""

Note that we use the name of a set here: `ochapter`.
It is not a known node type in the BHSA, so we have to tell it what it means.
We do that by passing a dictionary of custom sets.
The keys are the names of the sets, which are the values.

Then we may use those keys in queries, everywhere where a node type is expected.

In [15]:
results2 = A.search(query2, sets=sets)

  1.55s 6 results


In [16]:
A.table(results2)

n,p,chapter,verse,clause,phrase,word,phrase.1,word.1
1,Psalms 47:6,Psalms 47,,◊¢÷∏◊ú÷∏÷£◊î ◊ê÷±÷≠◊ú÷π◊î÷¥◊ô◊ù ◊ë÷¥÷º◊™÷∞◊®◊ï÷º◊¢÷∏÷ë◊î,◊¢÷∏◊ú÷∏÷£◊î,◊¢÷∏◊ú÷∏÷£◊î,◊ê÷±÷≠◊ú÷π◊î÷¥◊ô◊ù,◊ê÷±÷≠◊ú÷π◊î÷¥◊ô◊ù
2,Psalms 47:9,Psalms 47,,◊û÷∏◊ú÷∑÷£◊ö÷∞ ◊ê÷±÷≠◊ú÷π◊î÷¥◊ô◊ù ◊¢÷∑◊ú÷æ◊í÷π÷º◊ï◊ô÷¥÷ë◊ù,◊û÷∏◊ú÷∑÷£◊ö÷∞,◊û÷∏◊ú÷∑÷£◊ö÷∞,◊ê÷±÷≠◊ú÷π◊î÷¥◊ô◊ù,◊ê÷±÷≠◊ú÷π◊î÷¥◊ô◊ù
3,Psalms 47:9,Psalms 47,,◊ê÷±÷ù◊ú÷π◊î÷¥÷ó◊ô◊ù ◊ô÷∏◊©÷∑◊Å÷§◊ë◊Ä ◊¢÷∑◊ú÷æ◊õ÷¥÷º◊°÷µ÷º÷¨◊ê ◊ß÷∏◊ì÷∞◊©÷π÷Ω◊Å◊ï◊É,◊ô÷∏◊©÷∑◊Å÷§◊ë◊Ä,◊ô÷∏◊©÷∑◊Å÷§◊ë◊Ä,◊ê÷±÷ù◊ú÷π◊î÷¥÷ó◊ô◊ù,◊ê÷±÷ù◊ú÷π◊î÷¥÷ó◊ô◊ù
4,Psalms 53:3,Psalms 53,,◊ê÷±÷Ω◊ú÷π◊î÷¥÷ó◊ô◊ù ◊û÷¥◊©÷∏÷º◊Å◊û÷∑◊ô÷¥◊ù÷Æ ◊î÷¥◊©÷∞◊Å◊ß÷¥÷¢◊ô◊£ ◊¢÷∑÷Ω◊ú÷æ◊ë÷∞÷º◊†÷µ÷´◊ô ◊ê÷∏◊ì÷∏÷•◊ù,◊î÷¥◊©÷∞◊Å◊ß÷¥÷¢◊ô◊£,◊î÷¥◊©÷∞◊Å◊ß÷¥÷¢◊ô◊£,◊ê÷±÷Ω◊ú÷π◊î÷¥÷ó◊ô◊ù,◊ê÷±÷Ω◊ú÷π◊î÷¥÷ó◊ô◊ù
5,Psalms 53:6,Psalms 53,,◊õ÷¥÷º÷Ω◊ô÷æ◊ê÷±◊ú÷π◊î÷¥÷ó◊ô◊ù ◊§÷¥÷º÷≠◊ñ÷∑÷º◊® ◊¢÷∑◊¶÷∞◊û÷π÷£◊ï◊™ ◊ó÷π◊†÷∏÷ë◊ö÷∞,◊§÷¥÷º÷≠◊ñ÷∑÷º◊®,◊§÷¥÷º÷≠◊ñ÷∑÷º◊®,◊ê÷±◊ú÷π◊î÷¥÷ó◊ô◊ù,◊ê÷±◊ú÷π◊î÷¥÷ó◊ô◊ù
6,Psalms 70:5,Psalms 70,,◊ô÷¥◊í÷∞◊ì÷∑÷º÷£◊ú ◊ê÷±◊ú÷π◊î÷¥÷ë◊ô◊ù,◊ô÷¥◊í÷∞◊ì÷∑÷º÷£◊ú,◊ô÷¥◊í÷∞◊ì÷∑÷º÷£◊ú,◊ê÷±◊ú÷π◊î÷¥÷ë◊ô◊ù,◊ê÷±◊ú÷π◊î÷¥÷ë◊ô◊ù


## Custom sets in the browser

We save the sets in a file.
But before we do so, we also want to save all ordinary verses in a set, and all ordinary words.

In [17]:
queryV = f"""
verse
/without/
  word freq_lex<{FREQ}
/-/
"""
resultsV = A.search(queryV, shallow=True)
sets["overse"] = resultsV

  0.52s 2751 results


In [18]:
sets["oword"] = {w for w in F.otype.s("word") if F.freq_lex.v(w) >= FREQ}

In [19]:
SETS_FILE = os.path.expanduser("~/Downloads/ordinary.set")
writeSets(sets, SETS_FILE)

True

As a test, we read back the sets from disk and compare the number of
elements with those in the original sets, which we still have in memory.

In [21]:
testSets = readSets(SETS_FILE)
for s in sorted(testSets):
    elems = len(testSets[s])
    oelems = len(sets[s])
    print(f"{s} with {elems} nb {elems - oelems}")

ochapter with 60 nb 0
overse with 2751 nb 0
oword with 361411 nb 0


Now you can start your TF browser as follows:

```sh
text-fabric bhsa --sets=~/Downloads/ordinary.set
```

and then you can run the same queries over there!

# Appendix: investigation

Let's investigate the number of ordinary chapters with shifting definitions of ordinary

In [22]:
allChapters = F.otype.s("chapter")
longestChapter = max(len(L.d(chapter, otype="word")) for chapter in allChapters)

print(f"There are {len(allChapters)} chapters, the longest is {longestChapter} words")

There are 929 chapters, the longest is 1603 words


In [23]:
def getOrdinary(freq, amount):
    results = []

    for chapter in allChapters:
        if (
            len(
                [
                    word
                    for word in L.d(chapter, otype="word")
                    if F.freq_lex.v(word) < freq
                ]
            )
            < amount
        ):
            results.append(chapter)
    return results

In [24]:
def overview(freq):
    for amount in range(20, 1700, 50):
        results = getOrdinary(freq, amount)
        print(
            f"for freq={freq:>3} and amount={amount:>4}: {len(results):>4} ordinary chapters"
        )
        if len(results) >= len(allChapters):
            break

In [25]:
for freq in (40, 70, 100):
    overview(freq)

for freq= 40 and amount=  20:  139 ordinary chapters
for freq= 40 and amount=  70:  757 ordinary chapters
for freq= 40 and amount= 120:  885 ordinary chapters
for freq= 40 and amount= 170:  908 ordinary chapters
for freq= 40 and amount= 220:  919 ordinary chapters
for freq= 40 and amount= 270:  923 ordinary chapters
for freq= 40 and amount= 320:  924 ordinary chapters
for freq= 40 and amount= 370:  925 ordinary chapters
for freq= 40 and amount= 420:  926 ordinary chapters
for freq= 40 and amount= 470:  928 ordinary chapters
for freq= 40 and amount= 520:  929 ordinary chapters
for freq= 70 and amount=  20:   60 ordinary chapters
for freq= 70 and amount=  70:  550 ordinary chapters
for freq= 70 and amount= 120:  842 ordinary chapters
for freq= 70 and amount= 170:  889 ordinary chapters
for freq= 70 and amount= 220:  915 ordinary chapters
for freq= 70 and amount= 270:  922 ordinary chapters
for freq= 70 and amount= 320:  923 ordinary chapters
for freq= 70 and amount= 370:  923 ordinary ch

# All steps

* **[start](start.ipynb)** your first step in mastering the bible computationally
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates

---

[advanced](searchAdvanced.ipynb)
sets

You have seen how to mingle sets with queries.

Time to enter the race for space:

[relations](searchRelations.ipynb)
[quantifiers](searchQuantifiers.ipynb)
[from MQL](searchFromMQL.ipynb)
[rough](searchRough.ipynb)
[gaps](searchGaps.ipynb)

---

* **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[export](export.ipynb)** export your dataset as an Emdros database
* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features
* **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus
* **[volumes](volumes.ipynb)** work with selected books only
* **[trees](trees.ipynb)** work with the BHSA data as syntax trees

CC-BY Dirk Roorda