{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "You might want to consider the [start](search.ipynb) of this tutorial.\n", "\n", "Short introductions to other TF datasets:\n", "\n", "* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),\n", "* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),\n", "or the\n", "* [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:39.818664Z", "start_time": "2018-05-24T10:06:39.796588Z" } }, "outputs": [], "source": [ "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:48.865143Z", "start_time": "2018-05-24T10:06:44.712958Z" } }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " TF: TF API 12.1.2, ETCBC/bhsa/app v3, Search Reference
\n", " Data: ETCBC - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: ETCBC/bhsa
  3. appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
  4. commit: gb112c161cfd21eae403d51a2733740d8743460e7
  5. css: ''
  6. dataDisplay:
    • exampleSectionHtml:<code>Genesis 1:1</code> (use <a href=\"https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf\" target=\"_blank\">English book names</a>)
    • excludedFeatures:
      • g_uvf_utf8
      • g_vbs
      • kq_hybrid
      • languageISO
      • g_nme
      • lex0
      • is_root
      • g_vbs_utf8
      • g_uvf
      • dist
      • root
      • suffix_person
      • g_vbe
      • dist_unit
      • suffix_number
      • distributional_parent
      • kq_hybrid_utf8
      • crossrefSET
      • instruction
      • g_prs
      • lexeme_count
      • rank_occ
      • g_pfm_utf8
      • freq_occ
      • crossrefLCS
      • functional_parent
      • g_pfm
      • g_nme_utf8
      • g_vbe_utf8
      • kind
      • g_prs_utf8
      • suffix_gender
      • mother_object_type
    • noneValues:
      • absent
      • n/a
      • none
      • unknown
      • no value
      • NA
  7. docs:
    • docBase: {docRoot}/{repo}
    • docExt: ''
    • docPage: ''
    • docRoot: https://{org}.github.io
    • featurePage: 0_home
  8. interfaceDefaults: {}
  9. isCompatible: True
  10. local: local
  11. localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
  12. provenanceSpec:
    • corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
    • doi: 10.5281/zenodo.1007624
    • extraData: ner
    • moduleSpecs:
      • :
        • backend: no value
        • corpus: Phonetic Transcriptions
        • docUrl:https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
        • doi: 10.5281/zenodo.1007636
        • org: ETCBC
        • relative: /tf
        • repo: phono
      • :
        • backend: no value
        • corpus: Parallel Passages
        • docUrl:https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
        • doi: 10.5281/zenodo.1007642
        • org: ETCBC
        • relative: /tf
        • repo: parallels
    • org: ETCBC
    • relative: /tf
    • repo: bhsa
    • version: 2021
    • webBase: https://shebanq.ancient-data.org/hebrew
    • webHint: Show this on SHEBANQ
    • webLang: la
    • webLexId: True
    • webUrl:{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
    • webUrlLex: {webBase}/word?version={version}&id=<lid>
  13. release: v1.8.1
  14. typeDisplay:
    • clause:
      • label: {typ} {rela}
      • style: ''
    • clause_atom:
      • hidden: True
      • label: {code}
      • level: 1
      • style: ''
    • half_verse:
      • hidden: True
      • label: {label}
      • style: ''
      • verselike: True
    • lex:
      • featuresBare: gloss
      • label: {voc_lex_utf8}
      • lexOcc: word
      • style: orig
      • template: {voc_lex_utf8}
    • phrase:
      • label: {typ} {function}
      • style: ''
    • phrase_atom:
      • hidden: True
      • label: {typ} {rela}
      • level: 1
      • style: ''
    • sentence:
      • label: {number}
      • style: ''
    • sentence_atom:
      • hidden: True
      • label: {number}
      • level: 1
      • style: ''
    • subphrase:
      • hidden: True
      • label: {number}
      • style: ''
    • word:
      • features: pdp vs vt
      • featuresBare: lex:gloss
  15. writing: hbo
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gaps and spans\n", "\n", "Searches often do not deliver the results you expect.\n", "Besides typos, lack of familiarity with the template formalism and bugs in the system, there is\n", "another cause: **difficult semantics of the data**.\n", "\n", "Most users reason about phrases, clauses and sentences as if they are consecutive blocks of words.\n", "But in the BHSA this is not the case: each of these objects may have **gaps**.\n", "\n", "Most of the time, verse boundaries coincide with the boundaries of sentences, clauses, and phrases.\n", "But not always, there are verse **spanning** sentences.\n", "\n", "> **Note**\n", "These phenomena may wreak havoc with your intuitive reasoning about what search templates should deliver.\n", "Query templates do not require the objects to be consecutive and\n", "still they make sense. But that might not be your sense, unless you **Mind the gap!**\n", "\n", "We are going to show these issues in depth." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gaps\n", "\n", "TF-search has no primitives to deal with gaps directly.\n", "Nodes correspond to textual objects such as words, phrases, clauses, verses, books.\n", "Usually these are consecutive sequences of one or more words,\n", "but in theory they can be arbitrary sets of slots.\n", "\n", "And, as far as the BHSA corpus is concerned, in practice too.\n", "If we look at phrases, then the overwhelming majority is consecutive, without gaps,\n", "But there is also a substantial amount of phrases with gaps.\n", "\n", "People that are familiar with MQL (see [from MQL](searchFromMQL.ipynb))\n", "may remember that in MQL you can search for a gap.\n", "The MQL query\n", "\n", "```\n", "SELECT ALL OBJECTS WHERE\n", "\n", "[phrase FOCUS\n", " [word lex='L']\n", " [gap]\n", "]\n", "```\n", "\n", "looks for a phrase with a gap in it\n", "(i.e. one or more consecutive words between the start and the end of the phrase\n", "that do not belong to the phrase).\n", "The query then asks additionally for those gap-containing phrases that have a certain word in front of the gap.\n", "\n", "**We want this too!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find the gap\n", "\n", "We start with a query that aims to get the same results as the MQL query above.\n", "\n", "In our template, we require that there is a word `wPreGap` in the phrase that is just before the gap,\n", "a word `wGap` that comes right after, so it is in the gap, and hence does not belong to the phrase.\n", "But this all must happen before the last word `wLast` of the phrase." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:09:32.685437Z", "start_time": "2018-05-24T10:09:32.680670Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "verse\n", " p:phrase\n", " wPreGap:word lex=L\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wPreGap <: wGap\n", "wGap < wLast\n", "p || wGap\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.46s 12 results\n" ] } ], "source": [ "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice and quick.\n", "Let's see the results." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:09:43.410941Z", "start_time": "2018-05-24T10:09:43.194596Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npversephrasewordwordword
1Genesis 17:7לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ לְךָ֙ אַחֲרֶֽיךָ׃ לֵֽ
2Genesis 28:4לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ אִתָּ֑ךְ לְךָ֙ אִתָּ֑ךְ אֶת־
3Exodus 30:21לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו לָהֶ֧ם זַרְעֹ֖ו חָק־
4Leviticus 25:6לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ וְלַאֲמָתֶ֑ךָ וְלִשְׂכִֽירְךָ֙ וּלְתֹושָׁ֣בְךָ֔ לָכֶם֙ תֹושָׁ֣בְךָ֔ לְ
5Numbers 20:15לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃ לָ֛נוּ אֲבֹתֵֽינוּ׃ מִצְרַ֖יִם
6Numbers 32:33לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְלִבְנֵ֨י רְאוּבֵ֜ן וְלַחֲצִ֣י׀ שֵׁ֣בֶט׀ מְנַשֶּׁ֣ה בֶן־יֹוסֵ֗ף לָהֶ֣ם׀ יֹוסֵ֗ף מֹשֶׁ֡ה
7Deuteronomy 1:36לֹֽו־וּלְבָנָ֑יו לֹֽו־בָנָ֑יו אֶתֵּ֧ן
8Deuteronomy 26:11לְךָ֛ וּלְבֵיתֶ֑ךָ לְךָ֛ בֵיתֶ֑ךָ יְהוָ֥ה
91_Samuel 25:31לְךָ֡ לַאדֹנִ֗י לְךָ֡ אדֹנִ֗י לְ
102_Kings 25:24לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם לָהֶ֤ם אַנְשֵׁיהֶ֔ם גְּדַלְיָ֨הוּ֙
11Jeremiah 40:9לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם לָהֶ֜ם אַנְשֵׁיהֶ֣ם גְּדַלְיָ֨הוּ
12Daniel 9:8לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ וְלַאֲבֹתֵ֑ינוּ לָ֚נוּ אֲבֹתֵ֑ינוּ בֹּ֣שֶׁת
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, skipCols=\"1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's color the word in the gap differently." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "A.displaySetup(\n", " colorMap={2: \"aqua\", 3: \"yellow\", 4: \"magenta\"}, condenseType=\"clause\",\n", " skipCols=\"1\",\n", ")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:16:40.841646Z", "start_time": "2018-05-18T09:16:40.654538Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npversephrasewordwordword
1Genesis 17:7לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ לְךָ֙ אַחֲרֶֽיךָ׃ לֵֽ
2Genesis 28:4לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ אִתָּ֑ךְ לְךָ֙ אִתָּ֑ךְ אֶת־
3Exodus 30:21לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו לָהֶ֧ם זַרְעֹ֖ו חָק־
4Leviticus 25:6לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ וְלַאֲמָתֶ֑ךָ וְלִשְׂכִֽירְךָ֙ וּלְתֹושָׁ֣בְךָ֔ לָכֶם֙ תֹושָׁ֣בְךָ֔ לְ
5Numbers 20:15לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃ לָ֛נוּ אֲבֹתֵֽינוּ׃ מִצְרַ֖יִם
6Numbers 32:33לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְלִבְנֵ֨י רְאוּבֵ֜ן וְלַחֲצִ֣י׀ שֵׁ֣בֶט׀ מְנַשֶּׁ֣ה בֶן־יֹוסֵ֗ף לָהֶ֣ם׀ יֹוסֵ֗ף מֹשֶׁ֡ה
7Deuteronomy 1:36לֹֽו־וּלְבָנָ֑יו לֹֽו־בָנָ֑יו אֶתֵּ֧ן
8Deuteronomy 26:11לְךָ֛ וּלְבֵיתֶ֑ךָ לְךָ֛ בֵיתֶ֑ךָ יְהוָ֥ה
91_Samuel 25:31לְךָ֡ לַאדֹנִ֗י לְךָ֡ אדֹנִ֗י לְ
102_Kings 25:24לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם לָהֶ֤ם אַנְשֵׁיהֶ֔ם גְּדַלְיָ֨הוּ֙
11Jeremiah 40:9לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם לָהֶ֜ם אַנְשֵׁיהֶ֣ם גְּדַלְיָ֨הוּ
12Daniel 9:8לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ וְלַאֲבֹתֵ֑ינוּ לָ֚נוּ אֲבֹתֵ֑ינוּ בֹּ֣שֶׁת
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, condensed=False)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:16:40.841646Z", "start_time": "2018-05-18T09:16:40.654538Z" } }, "outputs": [ { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

verse
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause InfC Adju
phrase VP Pred
lex=L
phrase PP Cmpl
phrase PP PreC
phrase PP Cmpl
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

verse
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause WYq0 NA
phrase CP Conj
lex=W
phrase VP Pred
phrase PP Cmpl
phrase PP Objc
phrase PP Cmpl
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

verse
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause WQtX NA
phrase CP Conj
lex=W
phrase VP Pred
phrase PP Cmpl
phrase NP Subj
lex=XQ/
phrase PP Cmpl
lex=W
lex=L
phrase PP Adju
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, end=3, condensed=False)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "A.displayReset()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## All gapped phrases\n", "\n", "These were particular gaps.\n", "Now we want to get *all* gapped phrases.\n", "\n", "We can just lift the special requirement that\n", "the `preGapWord` has to satisfy a special lexical condition." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:09:32.685437Z", "start_time": "2018-05-24T10:09:32.680670Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "p:phrase\n", " wPreGap:word\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wPreGap <: wGap\n", "wGap < wLast\n", "\n", "p || wGap\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.91s 716 results\n" ] } ], "source": [ "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not too bad! We could wait for it. Here are some results." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
npphrasewordwordword
5Genesis 2:25שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו שְׁנֵיהֶם֙ אִשְׁתֹּ֑ו עֲרוּמִּ֔ים
6Genesis 4:4הֶ֨בֶל גַם־ה֛וּא הֶ֨בֶל ה֛וּא הֵבִ֥יא
7Genesis 7:8מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל בְּהֵמָ֔ה כֹ֥ל אֲשֶׁ֥ר
8Genesis 7:14הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃ רֶ֛מֶשׂ כָּנָֽף׃ הָ
9Genesis 7:21כָּל־בָּשָׂ֣ר׀ בָּעֹ֤וף וּבַבְּהֵמָה֙ וּבַ֣חַיָּ֔ה וּבְכָל־הַשֶּׁ֖רֶץ וְכֹ֖ל הָאָדָֽם׃ בָּשָׂ֣ר׀ אָדָֽם׃ הָ
10Genesis 7:21כָּל־בָּשָׂ֣ר׀ בָּעֹ֤וף וּבַבְּהֵמָה֙ וּבַ֣חַיָּ֔ה וּבְכָל־הַשֶּׁ֖רֶץ וְכֹ֖ל הָאָדָֽם׃ שֶּׁ֖רֶץ אָדָֽם׃ הַ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(results, start=5, end=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a phrase has multiple gaps, we encounter it multiple times in our results.\n", "\n", "We show the two results in Genesis 7:21." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

result 9" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause:428188 WayX NA
phrase:653516 CP Conj
3448 וַ
phrase:653517 VP Pred
phrase:653518 NP Subj
clause:428188 WayX NA
phrase:653518 NP Subj
3457 בָּ
3458
3460 וּ
3461 בַ
3462
3464 וּ
3465 בַ֣
3466
3468 וּ
3469 בְ
3471 הַ
clause:428188 WayX NA
phrase:653518 NP Subj
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause:428189 Ptcp Attr
phrase:653519 CP Rela
3452 הָ
phrase:653520 VP PreC
phrase:653521 PP Cmpl
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 10" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause:428188 WayX NA
phrase:653516 CP Conj
3448 וַ
phrase:653517 VP Pred
phrase:653518 NP Subj
clause:428188 WayX NA
phrase:653518 NP Subj
3457 בָּ
3458
3460 וּ
3461 בַ
3462
3464 וּ
3465 בַ֣
3466
3468 וּ
3469 בְ
3471 הַ
clause:428188 WayX NA
phrase:653518 NP Subj
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause:428190 Ptcp Attr
phrase:653522 CP Rela
3473 הַ
phrase:653523 VP PreC
phrase:653524 PP Cmpl
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(\n", " results,\n", " condensed=False,\n", " condenseType=\"clause\",\n", " start=9,\n", " end=10,\n", " colorMap={1: \"lightgreen\", 2: \"orange\", 4: \"magenta\"},\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want just the phrases, and only once, we can run the query in shallow mode, see [advanced](searchAdvanced.ipynb):" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.08s 672 results\n" ] } ], "source": [ "gapQueryResults = A.search(query, shallow=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excursion\n", "\n", "Sometimes there are two subphrases with exactly the same words in it. They only differ in their\n", "values for the feature `rela`. Let's find such a case.\n", "\n", "We do have to show the subphrases, though.\n", "\n", "Some types are hidden, let's find out which ones:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
current display options
  1. baseTypes: {word}
  2. colorMap: no value
  3. condenseType: verse
  4. condensed: 0
  5. edgeFeatures: set()
  6. edgeHighlights: no value
  7. end: no value
  8. extraFeatures:
    • ()
    • {}
  9. fmt: text-orig-full
  10. forceEdges: 0
  11. full: 0
  12. hiddenTypes:
    • clause_atom
    • sentence_atom
    • phrase_atom
    • subphrase
    • half_verse
  13. hideTypes: True
  14. highlights: {}
  15. lineNumbers: no value
  16. multiFeatures: 0
  17. noneValues:
    • absent
    • n/a
    • none
    • unknown
    • no value
    • NA
  18. plainGaps: True
  19. prettyTypes: True
  20. queryFeatures: True
  21. showGraphics: no value
  22. showMath: 0
  23. skipCols: set()
  24. standardFeatures: 0
  25. start: no value
  26. suppress: set()
  27. tupleFeatures:
    • :
      • 0
      • ()
    • :
      • 1
      • ()
    • :
      • 2
      • ()
    • :
      • 3
      • ()
  28. withLabels: True
  29. withNodes: 0
  30. withPassage: True
  31. withTypes: 0
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.displayShow()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.20s 2135 results\n" ] } ], "source": [ "dResults = A.search(\"\"\"\n", "s1:subphrase\n", "== s2:subphrase\n", "\n", "s1 < s2\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pass a different set of hidden types when showing the results:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

phrase 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase:653102 PP Objc
subphrase:1301305
subphrase:1301306
subphrase:1301307
2598 וְ
subphrase:1301308
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase:653192 PP Adju
2746 מֵֽ
subphrase:1301341
subphrase:1301342
subphrase:1301343
2752 וְ
subphrase:1301346
subphrase:1301344
subphrase:1301345
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase:653214 NP Objc
subphrase:1301355
subphrase:1301356
subphrase:1301357
subphrase:1301358
subphrase:1301359
2792 וְ
subphrase:1301360
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(\n", " dResults,\n", " end=3,\n", " withNodes=True,\n", " condensed=True,\n", " hiddenTypes=\"clause_atom half_verse phrase_atom\",\n", " condenseType=\"phrase\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More research has shown that these subphrases differ in the presence of the `rela` feature." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase:653102 PP Objc
subphrase:1301305
subphrase:1301306
rela=par
subphrase:1301307
2598 וְ
subphrase:1301308
rela=par
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase:653192 PP Adju
2746 מֵֽ
subphrase:1301341
subphrase:1301342
rela=par
subphrase:1301343
2752 וְ
subphrase:1301346
rela=par
subphrase:1301344
subphrase:1301345
rela=rec
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

phrase:653214 NP Objc
subphrase:1301355
subphrase:1301356
rela=adj
subphrase:1301357
subphrase:1301358
rela=par
subphrase:1301359
2792 וְ
subphrase:1301360
rela=par
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(\n", " dResults,\n", " end=3,\n", " withNodes=True,\n", " colorMap = {1: \"lightsalmon\", 2: \"lightblue\"},\n", " extraFeatures=\"rela\",\n", " highlights=highlights,\n", " hiddenTypes=\"clause_atom half_verse phrase_atom\",\n", " condenseType=\"phrase\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The red one has a feature `rela='par'`, the blue one not.\n", "\n", "**Two nodes with the same node type and the same slots. Yet: different nodes, different feature annotations.**\n", "\n", "At the moment I do not know why the encoders of the BHSA have chosen to do this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A different query\n", "\n", "We can make an equivalent query to get the gaps." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:30.980164Z", "start_time": "2018-05-24T08:41:30.974422Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "p:phrase\n", " =: wFirst:word\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wFirst < wGap\n", "wLast > wGap\n", "\n", "p || wGap\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experience has shown that this is a slow query, so we handle it with care." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 4 objects ...\n", " 0.20s Constraining search space with 7 relations ...\n", " 0.50s \t2 edges thinned\n", " 0.50s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.53s Ready to deliver results from 1186199 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 4 objects and 6 relations\n", "Results are instantiations of the following objects:\n", "node 0-phrase 253203 choices\n", "node 1-word 253203 choices\n", "node 2-word 253203 choices\n", "node 3-word 426590 choices\n", "Performance parameters:\n", "\tyarnRatio = 1.25\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "Instantiations are computed along the following relations:\n", "node 0-phrase 253203 choices\n", "edge 0-phrase [[ 2-word 1.0 choices\n", "edge 0-phrase := 2-word 0 choices\n", "edge 0-phrase =: 1-word 1.0 choices (thinned)\n", "edge 1-word ]] 0-phrase 0 choices\n", "edge 1,2-word <,> 3-word 21329.5 choices\n", "edge 3-word || 0-phrase 0 choices\n", " 0.58s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 p:phrase\n", " 2 R1 =: wFirst:word\n", " 3 R2 wLast:word\n", " 4 :=\n", " 5 \n", " 6 R3 wGap:word\n", " 7 wFirst < wGap\n", " 8 wLast > wGap\n", " 9 \n", "10 p || wGap\n", "11 \n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting results per 1 up to 4 ...\n", " | 4.29s 1\n", " | 4.29s 2\n", " | 4.29s 3\n", " | 4.29s 4\n", " 4.29s Done: 5 results\n" ] } ], "source": [ "S.count(progress=1, limit=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a good example of a query that is slow to deliver even its first result.\n", "And that is bad, because it is such a straightforward query.\n", "\n", "Why is this one so slow, while the previous one went so smoothly?\n", "\n", "The crucial thing is the `wGap` word. In the latter template, `wGap` is not embedded in anything.\n", "It is constrained by `wFirst < wGap` and `wGap < wLast`.\n", "However, the way the search strategy works is by examining all possibilities for `wFirst < wGap`\n", "and only then checking whether `wGap < wLast`.\n", "The algorithm cannot check both conditions at the same time.\n", "\n", "With embedding relations, things are better. Text-Fabric is heavily optimized to deal with embedding\n", "relationships.\n", "\n", "In the former template, we see that the `wGap` is required to be `adjacent` to `wPreGap`, and this one\n", "is embedded in the phrase. Hence there are few cases to consider for `wPreGap`, and per instance\n", "there is only one `wGap`.\n", "\n", "> **Lesson**\n", "Try to prevent the use of *free floating* nodes in your template that become constrained\n", "by other spatial relationships than embedding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### To the rescue\n", "The former template had it right.\n", "Can we rescue the latter template?\n", "\n", "We can assume that the phrase and the gap each contain a word in one and the same verse.\n", "Note that phrase and gap may belong to different clauses and sentences.\n", "We assume that a phrase cannot belong to more than two verses, so either the first or the last word\n", "of the phrase is in the same verse as a word in the gap." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:30.980164Z", "start_time": "2018-05-24T08:41:30.974422Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "p:phrase\n", " =: wFirst:word\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wFirst < wGap\n", "wLast > wGap\n", "\n", "p || wGap\n", "\n", "v:verse\n", "\n", "v [[ wFirst\n", "v [[ wGap\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 5 objects ...\n", " 0.21s Constraining search space with 9 relations ...\n", " 0.53s \t2 edges thinned\n", " 0.53s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.57s Ready to deliver results from 1209412 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n", "Search with 5 objects and 8 relations\n", "Results are instantiations of the following objects:\n", "node 0-phrase 253203 choices\n", "node 1-word 253203 choices\n", "node 2-word 253203 choices\n", "node 3-word 426590 choices\n", "node 4-verse 23213 choices\n", "Performance parameters:\n", "\tyarnRatio = 1.25\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "Instantiations are computed along the following relations:\n", "node 4-verse 23213 choices\n", "edge 4-verse [[ 1-word 12.1 choices\n", "edge 1-word ]] 0-phrase 1.0 choices\n", "edge 0-phrase =: 1-word 0 choices\n", "edge 0-phrase := 2-word 1.0 choices (thinned)\n", "edge 0-phrase [[ 2-word 0 choices\n", "edge 4-verse [[ 3-word 18.9 choices\n", "edge 1,2-word <,> 3-word 0 choices\n", "edge 3-word || 0-phrase 0 choices\n", " 0.58s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 p:phrase\n", " 2 R1 =: wFirst:word\n", " 3 R2 wLast:word\n", " 4 :=\n", " 5 \n", " 6 R3 wGap:word\n", " 7 wFirst < wGap\n", " 8 wLast > wGap\n", " 9 \n", "10 p || wGap\n", "11 \n", "12 R4 v:verse\n", "13 \n", "14 v [[ wFirst\n", "15 v [[ wGap\n", "16 \n", " 0.00s Counting results per 100 up to 3000 ...\n", " | 0.08s 100\n", " | 0.17s 200\n", " | 0.23s 300\n", " | 0.26s 400\n", " | 0.29s 500\n", " | 0.35s 600\n", " | 0.37s 700\n", " | 0.44s 800\n", " | 0.46s 900\n", " | 0.50s 1000\n", " | 0.54s 1100\n", " | 0.57s 1200\n", " | 0.61s 1300\n", " | 0.70s 1400\n", " | 0.90s 1500\n", " | 0.95s 1600\n", " | 1.06s 1700\n", " | 1.11s 1800\n", " | 1.17s 1900\n", " | 1.26s 2000\n", " | 1.32s 2100\n", " | 1.37s 2200\n", " | 1.47s 2300\n", " | 1.80s 2400\n", " | 1.89s 2500\n", " | 1.98s 2600\n", " | 2.09s 2700\n", " 2.15s Done: 2739 results\n" ] } ], "source": [ "S.study(query)\n", "S.showPlan(details=True)\n", "S.count(progress=100, limit=3000)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# ignore this\n", "# S.tweakPerformance(yarnRatio=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to run this query in `shallow` mode." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4.37s 672 results\n" ] } ], "source": [ "results = A.search(query, shallow=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shallow mode tends to be quicker, but that does not always materialize.\n", "The number of results agrees with the first query.\n", "Yet we have been lucky, because we required the word in the gap to be in the same verse as the first word in the phrase.\n", "What if we require if it is the last word in the phrase?" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:30.980164Z", "start_time": "2018-05-24T08:41:30.974422Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "p:phrase\n", " =: wFirst:word\n", " wLast:word\n", " :=\n", "\n", "wGap:word\n", "wFirst < wGap\n", "wLast > wGap\n", "\n", "p || wGap\n", "\n", "v:verse\n", "\n", "v [[ wLast\n", "v [[ wGap\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4.39s 661 results\n" ] } ], "source": [ "results = A.search(query, shallow=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we would not have found all results.\n", "\n", "So, this road, although doable, is much less comfortable, performance-wise and logic-wise." ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2018-05-23T08:31:45.680786Z", "start_time": "2018-05-23T08:31:45.673210Z" }, "lines_to_next_cell": 2 }, "source": [ "## Check the gaps\n", "\n", "In this misty landscape of gaps we need some corroboration that we found the right results.\n", "\n", "1. is every node in `gapQueryResults` a phrase?\n", "1. does every phrase in the `gapQueryResults` have a gap?\n", "1. is every gapped phrase contained in `gapQueryResults`?\n", "\n", "We check all this by hand coding.\n", "\n", "Here is a function that checks whether a phrase has a gap.\n", "If the distance between its end points is greater than the number of words it contains,\n", "it must have a gap." ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:51.194078Z", "start_time": "2018-05-24T08:41:50.211615Z" } }, "outputs": [], "source": [ "def hasGap(p):\n", " words = L.d(p, otype=\"word\")\n", " return words[-1] - words[0] + 1 > len(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can perform the checks." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:51.194078Z", "start_time": "2018-05-24T08:41:50.211615Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "672 nodes in query result\n", "1. all nodes are phrases\n", "2. all nodes have gaps\n", "3. all gapped phrases are contained in the results\n" ] } ], "source": [ "otypesGood = True\n", "haveGaps = True\n", "\n", "for p in gapQueryResults:\n", " otype = F.otype.v(p)\n", " if otype != \"phrase\":\n", " print(f\"Non phrase detected: {p}) is a {otype}\")\n", " otypesGood = False\n", " break\n", "\n", " if not hasGap(p):\n", " print(f\"Phrase without a gap: {p}\")\n", " A.pretty(p)\n", " haveGaps = False\n", " break\n", "\n", "print(f\"{len(gapQueryResults)} nodes in query result\")\n", "if otypesGood:\n", " print(\"1. all nodes are phrases\")\n", "if haveGaps:\n", " print(\"2. all nodes have gaps\")\n", "\n", "inResults = True\n", "for p in F.otype.s(\"phrase\"):\n", " if hasGap(p):\n", " if p not in gapQueryResults:\n", " print(f\"Gapped phrase outside query results: {p}\")\n", " A.pretty(p)\n", " inResults = False\n", " break\n", "\n", "if inResults:\n", " print(\"3. all gapped phrases are contained in the results\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that by hand coding we can get the gapped phrases much more quickly and securely!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom sets for (non-)gapped phrases\n", "\n", "We have obtained a set with all gapped phrases,\n", "and we have paid a price:\n", "\n", "* either an expensive query,\n", "* or an inconvenient bit of hand coding.\n", "\n", "It would be nice if we could kick-start our queries using this set as a given.\n", "And that is exactly what we are going to do now.\n", "\n", "We make two custom sets and give them a name, `gapphrase` for gapped phrases and `conphrase` for non-gapped phrases (consecutive phrases)." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "customSets = dict(\n", " gapphrase=gapQueryResults,\n", " conphrase=set(F.otype.s(\"phrase\")) - gapQueryResults,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want all verbs that occur in a gapped phrase." ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:53.694434Z", "start_time": "2018-05-24T08:41:53.689921Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "gapphrase\n", " word sp=verb\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have used the foreign name `gapphrase` in our search template, instead of `phrase`.\n", "\n", "But we can still run `search()`, provided we tell it what we mean by `gapphrase`.\n", "We do that by passing the `sets` parameter to `search()`, which should be a dictionary of sets.\n", "Search will look up `gapphrase` in this dictionary, and will use its value, which should be a node set.\n", "That way, it understands that the expression `gapphrase` stands for the nodes in the given node set.\n", "\n", "Here we go:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:41:57.840028Z", "start_time": "2018-05-24T08:41:57.047787Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.20s 94 results\n" ] } ], "source": [ "results = A.search(query, sets=customSets)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:05:09.044933Z", "start_time": "2018-05-24T08:05:09.005186Z" } }, "outputs": [ { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause ZQtX NA
phrase VP PreO
phrase NP Subj
phrase VP PreO
phrase NP Objc
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause Way0 NA
phrase CP Conj
sp=conj
phrase VP Pred
phrase PP Time
sp=prep
sp=art
sp=art
phrase PP Objc
clause Way0 NA
phrase PP Objc
sp=conj
sp=subs
sp=prep
sp=art
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause Way0 NA
phrase CP Conj
sp=conj
phrase VP Pred
phrase PP Time
sp=prep
sp=art
sp=art
phrase PP Objc
clause Way0 NA
phrase PP Objc
sp=conj
sp=subs
sp=prep
sp=art
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, start=1, end=3, condenseType=\"clause\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks good.\n", "\n", "We can also apply feature conditions to `gapphrase`:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:05:41.293060Z", "start_time": "2018-05-24T08:05:41.237943Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s 177 results\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "
npphrase
1Genesis 2:25שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו
2Genesis 4:4הֶ֨בֶל גַם־ה֛וּא
3Genesis 7:14הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "gapphrase function=Subj\n", "\"\"\"\n", "results = A.search(query, sets=customSets)\n", "A.table(results, start=1, end=3)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:05:41.293060Z", "start_time": "2018-05-24T08:05:41.237943Z" } }, "outputs": [ { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause WayX NA
phrase CP Conj
function=Conj
phrase VP Pred
function=Pred
phrase NP Subj
function=Subj
phrase AdjP PreC
function=PreC
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 2" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

clause WXQt NA
phrase CP Conj
function=Conj
phrase PrNP Subj
function=Subj
phrase VP Pred
function=Pred
phrase PrNP Subj
function=Subj
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, start=1, end=3, condenseType=\"clause\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reduce the details by setting the `baseType` to `phrase`.\n", "The highlighted phrases will now get a yellow background." ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:05:41.293060Z", "start_time": "2018-05-24T08:05:41.237943Z" } }, "outputs": [ { "data": { "text/html": [ "

result 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

verse
sentence 17
clause Ellp NA
phrase הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ
function=Subj
clause Ptcp Attr
phrase הָ
function=Rela
phrase רֹמֵ֥שׂ
function=PreC
phrase עַל־הָאָ֖רֶץ
function=Cmpl
clause Ellp NA
phrase לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃
function=Subj
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, start=3, end=3, baseTypes=\"phrase\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reduce the details by setting the `baseType` to `phrase_atom`.\n", "The highlighted phrases will not get a yellow background now." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:05:41.293060Z", "start_time": "2018-05-24T08:05:41.237943Z" } }, "outputs": [ { "data": { "text/html": [ "

result 3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(results, start=3, end=3, baseTypes={\"phrase_atom\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Two-phrase clauses\n", "\n", "We can find the gaps, but do our minds always reckon with gaps?\n", "Gaps cause unexpected semantics.\n", "Here is a little puzzle.\n", "\n", "Suppose we want to count the clauses consisting of exactly two phrases.\n", "\n", "Here follows a little journey.\n", "We use a query to find the clauses, check the result with hand-coding, scratch our heads,\n", "refine the query, the hand-coding and our question until we are satisfied.\n", "\n", "### Attempt 1\n", "\n", "#### By query\n", "\n", "The following template should do it:\n", "a clause, starting with a phrase, followed by an adjacent phrase,\n", "which terminates the clause." ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:03.852429Z", "start_time": "2018-05-24T08:56:03.849179Z" } }, "outputs": [], "source": [ "query = \"\"\"\n", "clause\n", " =: phrase\n", " <: phrase\n", " :=\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "# ignore this\n", "# S.tweakPerformance(yarnRatio=1.2)" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking search template ...\n", " 0.00s Setting up search space for 3 objects ...\n", " 0.08s Constraining search space with 5 relations ...\n", " 0.27s \t2 edges thinned\n", " 0.27s Setting up retrieval plan with strategy small_choice_multi ...\n", " 0.29s Ready to deliver results from 264393 nodes\n", "Iterate over S.fetch() to get the results\n", "See S.showPlan() to interpret the results\n" ] } ], "source": [ "S.study(query)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Search with 3 objects and 5 relations\n", "Results are instantiations of the following objects:\n", "node 0-clause 88131 choices\n", "node 1-phrase 88131 choices\n", "node 2-phrase 88131 choices\n", "Performance parameters:\n", "\tyarnRatio = 1.25\n", "\ttryLimitFrom = 40\n", "\ttryLimitTo = 40\n", "Instantiations are computed along the following relations:\n", "node 0-clause 88131 choices\n", "edge 0-clause [[ 2-phrase 1.0 choices\n", "edge 2-phrase := 0-clause 0 choices\n", "edge 2-phrase :> 1-phrase 0.3 choices\n", "edge 1-phrase ]] 0-clause 0 choices\n", "edge 0-clause =: 1-phrase 0 choices\n", " 1.53s The results are connected to the original search template as follows:\n", " 0 \n", " 1 R0 clause\n", " 2 R1 =: phrase\n", " 3 R2 <: phrase\n", " 4 :=\n", " 5 \n" ] } ], "source": [ "S.showPlan(details=True)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:06.276198Z", "start_time": "2018-05-24T08:56:05.153080Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.53s 23486 results\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npclausephrasephrase
1Genesis 1:3יְהִ֣י אֹ֑ור יְהִ֣י אֹ֑ור
2Genesis 1:4כִּי־טֹ֑וב כִּי־טֹ֑וב
3Genesis 1:7אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ
4Genesis 1:7אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ
5Genesis 1:10כִּי־טֹֽוב׃ כִּי־טֹֽוב׃
6Genesis 1:11מַזְרִ֣יעַ זֶ֔רַע מַזְרִ֣יעַ זֶ֔רַע
7Genesis 1:12כִּי־טֹֽוב׃ כִּי־טֹֽוב׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "results = A.search(query)\n", "A.table(results, end=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to have the clauses only, we run it in shallow mode:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.43s 23486 results\n" ] } ], "source": [ "clausesByQuery = sorted(A.search(query, shallow=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note result 3 above: it seems we have 3 phrases. Yet there are only 2. We take a closer look:" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
clause NmCl Attr
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "focus = results[2][0]\n", "A.pretty(focus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One phrase is chunked into two *phrase atoms*, which are hidden by default. Let's make that more clear:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
clause NmCl Attr
clause_atom 10
phrase CP Rela
phrase_atom CP NA
phrase PP PreC
phrase_atom PP NA
phrase_atom PP Spec
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.pretty(focus, hideTypes=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### By hand\n", "\n", "Let us check this with a piece of hand-written code.\n", "We want clauses that consist of exactly two phrases." ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:12.592108Z", "start_time": "2018-05-24T08:56:11.096022Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s counting ...\n", " 0.21s Done: found 23864\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"counting ...\")\n", "\n", "clausesByHand = []\n", "for clause in F.otype.s(\"clause\"):\n", " phrases = L.d(clause, otype=\"phrase\")\n", " if len(phrases) == 2:\n", " clausesByHand.append(clause)\n", "clausesByHand = sorted(clausesByHand)\n", "A.info(f\"Done: found {len(clausesByHand)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The difference" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Strange, we end up with more cases. What is happening? Let us compare the results.\n", "We look at the first result where both methods diverge.\n", "\n", "We put the difference finding in a little function." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:16.255454Z", "start_time": "2018-05-24T08:56:16.244135Z" } }, "outputs": [], "source": [ "def showDiff(queryResults, handResults):\n", " diff = [x for x in zip(queryResults, handResults) if x[0] != x[1]]\n", " if not diff:\n", " print(\n", " f\"\"\"\n", "{len(queryResults):>6} queryResults\n", " are identical with\n", "{len(handResults):>6} handResults\n", "\"\"\"\n", " )\n", " return\n", " (rQuery, rHand) = diff[0]\n", " if rQuery < rHand:\n", " print(f\"clause {rQuery} is a query result but not found by hand\")\n", " toShow = rQuery\n", " else:\n", " print(f\"clause {rHand} is not a query result but has been found by hand\")\n", " toShow = rHand\n", " colors = [\"aqua\", \"aquamarine\", \"khaki\", \"lavender\", \"yellow\"]\n", " highlights = {}\n", " for (i, phrase) in enumerate(L.d(toShow, otype=\"phrase\")):\n", " highlights[phrase] = colors[i % len(colors)]\n", " for atom in L.d(phrase, otype=\"phrase_atom\"):\n", " highlights[atom] = colors[i % len(colors)]\n", " A.pretty(\n", " toShow,\n", " hideTypes=False,\n", " withNodes=True,\n", " suppress={\"lex\", \"sp\", \"vt\", \"vs\"},\n", " highlights=highlights,\n", " baseTypes=\"phrase_atom\",\n", " )" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:16.255454Z", "start_time": "2018-05-24T08:56:16.244135Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clause 427937 is not a query result but has been found by hand\n" ] }, { "data": { "text/html": [ "
clause:427937 XYqt NA
clause_atom:516080 112
phrase:652701 NP Subj
phrase_atom:905945 כָל־
clause:427937 XYqt NA
clause_atom:516082 222
phrase:652703 VP PreO
phrase_atom:905947 יַֽהַרְגֵֽנִי׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "showDiff(clausesByQuery, clausesByHand)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lo and behold:\n", "\n", "* the hand-written code is right in a sense: this is a clause that consists exactly of two phrases.\n", "* the query is also right in a sense: the two phrases are not adjacent: there is a gap in the clause between them!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attempt 2\n", "\n", "#### By hand\n", "\n", "We modify the hand-written code such that only clauses qualify if the two phrases are adjacent." ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:12.592108Z", "start_time": "2018-05-24T08:56:11.096022Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s counting ...\n", " 0.24s Done: found 23403\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"counting ...\")\n", "\n", "clausesByHand2 = []\n", "for clause in F.otype.s(\"clause\"):\n", " phrases = L.d(clause, otype=\"phrase\")\n", " if len(phrases) == 2:\n", " if L.d(phrases[0], otype=\"word\")[-1] + 1 == L.d(phrases[1], otype=\"word\")[0]:\n", " clausesByHand2.append(clause)\n", "clausesByHand2 = sorted(clausesByHand2)\n", "A.info(f\"Done: found {len(clausesByHand2)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The difference\n", "\n", "Now we have less cases. What is going on?" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:16.255454Z", "start_time": "2018-05-24T08:56:16.244135Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clause 428698 is a query result but not found by hand\n" ] }, { "data": { "text/html": [ "
clause:428698 WxQ0 NA
clause_atom:516896 427
phrase:655130 CP Conj
phrase_atom:908523 וְ
phrase:655131 PP Objc
phrase_atom:908524 גַם֩ אֶת־לֹ֨וט
phrase_atom:908525 אָחִ֤יו
phrase_atom:908526 וּ
phrase_atom:908527 רְכֻשֹׁו֙
phrase:655132 VP Pred
phrase_atom:908528 הֵשִׁ֔יב
phrase:655131 PP Objc
phrase_atom:908529 וְ
phrase_atom:908530 גַ֥ם אֶת־הַנָּשִׁ֖ים וְאֶת־הָעָֽם׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "showDiff(clausesByQuery, clausesByHand2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe:\n", "\n", "This clause has three phrases, but the third one lies inside the second one.\n", "\n", "* the hand-written code is right in a sense: this clause has three phrases.\n", "* the query is right in a sense: it contains two adjacent phrases that together span the whole clause." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attempt 3\n", "\n", "#### By query\n", "\n", "Can we adjust the pattern to exclude cases like this?\n", "Yes, with custom sets, see [advanced](searchAdvanced.ipynb).\n", "\n", "Instead of looking through all phrases, we can just consider non gapped phrases only.\n", "\n", "Earlier in this notebook we have constructed the set of non-gapped phrases\n", "and put it under the name `conphrase` in the custom sets." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.43s 23330 results\n" ] } ], "source": [ "query = \"\"\"\n", "clause\n", " =: conphrase\n", " <: conphrase\n", " :=\n", "\"\"\"\n", "\n", "clausesByQuery2 = sorted(A.search(query, sets=customSets, shallow=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The difference\n", "\n", "There is still a difference." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:16.255454Z", "start_time": "2018-05-24T08:56:16.244135Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clause 428380 is not a query result but has been found by hand\n" ] }, { "data": { "text/html": [ "
clause:428380 Ellp NA
clause_atom:516560 402
phrase:654133 CP Conj
phrase_atom:907448 וְֽ
phrase:654134 PP Objc
phrase_atom:907449 אֶת־פַּתְרֻסִ֞ים וְאֶת־כַּסְלֻחִ֗ים
clause:428380 Ellp NA
clause_atom:516562 220
phrase:654134 PP Objc
phrase_atom:907454 וְ
phrase_atom:907455 אֶת־כַּפְתֹּרִֽים׃ ס
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "showDiff(clausesByQuery2, clausesByHand2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe:\n", "\n", "This clause has two phrases, the second one has a gap, which coincides with a gap in the clause.\n", "\n", "* the hand-written code is right in a sense: this clause has two phrases, adjacent, and they span the whole clause, nothing left out.\n", "* the query is right in a sense: the second phrase is not consecutive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attempt 4\n", "\n", "#### By hand\n", "\n", "We modify the hand-written code, so that only consecutive clauses qualify." ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:12.592108Z", "start_time": "2018-05-24T08:56:11.096022Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s counting ...\n", " 0.33s Done: found 23330\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"counting ...\")\n", "\n", "clausesByHand3 = []\n", "for clause in F.otype.s(\"clause\"):\n", " if hasGap(clause):\n", " continue\n", " phrases = L.d(clause, otype=\"phrase\")\n", " if len(phrases) == 2:\n", " if L.d(phrases[0], otype=\"word\")[-1] + 1 == L.d(phrases[1], otype=\"word\")[0]:\n", " clausesByHand3.append(clause)\n", "clausesByHand3 = sorted(clausesByHand3)\n", "A.info(f\"Done: found {len(clausesByHand3)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The difference\n", "\n", "Now the number of results agree. But are they really the same?" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:56:16.255454Z", "start_time": "2018-05-24T08:56:16.244135Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " 23330 queryResults\n", " are identical with\n", " 23330 handResults\n", "\n" ] } ], "source": [ "showDiff(clausesByQuery2, clausesByHand3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "It took four attempts to arrive at the final concept of things that we were looking for.\n", "\n", "Sometimes the search template had to be modified, sometimes the hand-written code.\n", "\n", "The interplay and systematic comparison between the attempts helped to spot all relevant\n", "configurations of phrases within clauses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Spans\n", "\n", "Here is another cause of wrong query results: there are sentences that span multiple verses.\n", "Such sentences are not contained in any verse.\n", "That makes that they are easily missed out in queries.\n", "\n", "We describe a scenario where that happens.\n", "\n", "## Mother clauses\n", "\n", "A clause and its mother do not have to be in the same verse.\n", "We are going to fetch are the cases where they are in different verses.\n", "\n", "### All mother clauses\n", "\n", "But first we fetch all pairs of clauses connected by a mother edge." ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:00:06.688698Z", "start_time": "2018-05-24T08:00:05.864656Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.08s 13917 results\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
npclausephrasephrase
1Genesis 1:3יְהִ֣י אֹ֑ור יְהִ֣י אֹ֑ור
2Genesis 1:4כִּי־טֹ֑וב כִּי־טֹ֑וב
3Genesis 1:7אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ
4Genesis 1:7אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ
5Genesis 1:10כִּי־טֹֽוב׃ כִּי־טֹֽוב׃
6Genesis 1:11מַזְרִ֣יעַ זֶ֔רַע מַזְרִ֣יעַ זֶ֔רַע
7Genesis 1:12כִּי־טֹֽוב׃ כִּי־טֹֽוב׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "clause\n", "-mother> clause\n", "\"\"\"\n", "allMotherPairs = A.search(query)\n", "A.table(results, end=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mother in another verse\n", "\n", "Now we modify the query to the effect that mother and daughter must sit in distinct verses." ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:00:11.096751Z", "start_time": "2018-05-24T08:00:10.585477Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.12s 710 results\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nclauseclauseverseverse
1Genesis 1:18  וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה Genesis 1:17  לְהָאִ֖יר עַל־הָאָֽרֶץ׃
2Genesis 2:7  וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה Genesis 2:4  בְּיֹ֗ום
3Genesis 7:3  לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃ Genesis 7:2  מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו
4Genesis 22:17  כִּֽי־בָרֵ֣ךְ אֲבָרֶכְךָ֗ Genesis 22:16  כִּ֗י
5Genesis 24:44  הִ֣וא הָֽאִשָּׁ֔ה Genesis 24:43  הָֽעַלְמָה֙
6Genesis 27:45  עַד־שׁ֨וּב אַף־אָחִ֜יךָ מִמְּךָ֗ Genesis 27:44  עַ֥ד אֲשֶׁר־תָּשׁ֖וּב חֲמַ֥ת אָחִֽיךָ׃
7Genesis 36:16  אַלּֽוּף־קֹ֛רַח אַלּ֥וּף גַּעְתָּ֖ם אַלּ֣וּף עֲמָלֵ֑ק Genesis 36:15  בְּנֵ֤י אֱלִיפַז֙ בְּכֹ֣ור עֵשָׂ֔ו אַלּ֤וּף תֵּימָן֙ אַלּ֣וּף אֹומָ֔ר אַלּ֥וּף צְפֹ֖ו אַלּ֥וּף קְנַֽז׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "cm:clause\n", "-mother> cd:clause\n", "\n", "v1:verse\n", "v2:verse\n", "v1 # v2\n", "\n", "cm ]] v1\n", "cd ]] v2\n", "\"\"\"\n", "diffMotherPairs = A.search(query)\n", "A.table(diffMotherPairs, end=7, skipCols=\"3 4\", withPassage=\"1 2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mother in same verse\n", "\n", "As a check,\n", "we modify the latter query and require `v1` and `v2` to be the same verse, to get the\n", "mother pairs of which both members are in the same verse." ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:00:11.096751Z", "start_time": "2018-05-24T08:00:10.585477Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.14s 13181 results\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nclauseclauseverseverse
1Genesis 1:4  כִּי־טֹ֑וב Genesis 1:4  וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור
2Genesis 1:10  כִּי־טֹֽוב׃ Genesis 1:10  וַיַּ֥רְא אֱלֹהִ֖ים
3Genesis 1:12  כִּי־טֹֽוב׃ Genesis 1:12  וַיַּ֥רְא אֱלֹהִ֖ים
4Genesis 1:14  לְהַבְדִּ֕יל בֵּ֥ין הַיֹּ֖ום וּבֵ֣ין הַלָּ֑יְלָה Genesis 1:14  יְהִ֤י מְאֹרֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם
5Genesis 1:15  לְהָאִ֖יר עַל־הָאָ֑רֶץ Genesis 1:15  וְהָי֤וּ לִמְאֹורֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם
6Genesis 1:17  לְהָאִ֖יר עַל־הָאָֽרֶץ׃ Genesis 1:17  וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם
7Genesis 1:18  וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ Genesis 1:18  וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "cm:clause\n", "-mother> cd:clause\n", "\n", "v1:verse\n", "v2:verse\n", "v1 = v2\n", "\n", "cm ]] v1\n", "cd ]] v2\n", "\"\"\"\n", "sameMotherPairs = A.search(query)\n", "A.table(sameMotherPairs, end=7, skipCols=\"3 4\", withPassage=\"1 2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The difference\n", "\n", "Let's check if the numbers add up:\n", "\n", "* the first query asked for all pairs\n", "* the second query asked for pairs with members in different verses\n", "* the third query asked for pairs with members in the same verse\n", "\n", "Then the results of the second and third query combined should\n", "equal the results of the first query.\n", "\n", "That makes sense.\n", "\n", "Still, let's check:" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:00:16.632029Z", "start_time": "2018-05-24T08:00:16.627787Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "26\n" ] } ], "source": [ "discrepancy = len(allMotherPairs) - len(diffMotherPairs) - len(sameMotherPairs)\n", "print(discrepancy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The numbers do not add up. We are missing cases. Why?\n", "\n", "Clauses may cross verse boundaries. In that case they are not part of a verse, and hence our latter two queries\n", "do not detect them. Let's count how many verse boundary crossing clauses there are." ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.59s 50 results\n" ] } ], "source": [ "query = \"\"\"\n", "clause\n", "/with/\n", "v1:verse\n", "&& ..\n", "v2:verse\n", "&& ..\n", "v1 < v2\n", "/-/\n", "\"\"\"\n", "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might think we can speed up the query by requiring `v1 <: v2` (both verses are adjacent).\n", "There are less possibilities to consider, to maybe we gain something." ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.56s 49 results\n" ] } ], "source": [ "query = \"\"\"\n", "clause\n", "/with/\n", "v1:verse\n", "&& ..\n", "v2:verse\n", "&& ..\n", "v1 <: v2\n", "/-/\n", "\"\"\"\n", "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, slightly faster, but one result less! How can that be?\n", "\n", "There must be a clause that spans at least two verses and in doing so, skips at least one verse.\n", "\n", "Let's find that one:" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.91s 1 result\n" ] } ], "source": [ "query = \"\"\"\n", "clause\n", "/with/\n", "v1:verse\n", "&& ..\n", "v2:verse\n", "|| ..\n", "v3:verse\n", "&& ..\n", "v1 < v2\n", "v2 < v3\n", "v1 < v3\n", "/-/\n", "\"\"\"\n", "resultsX = A.search(query)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
npclause
11_Kings 8:41וְגַם֙ אֶל־הַנָּכְרִ֔י אַתָּ֞ה תִּשְׁמַ֤ע הַשָּׁמַ֨יִם֙ מְכֹ֣ון שִׁבְתֶּ֔ךָ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 1" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

verse
sentence 92
clause WXYq NA
phrase וְ
phrase גַם֙ אֶל־הַנָּכְרִ֔י
clause NmCl Attr
phrase אֲשֶׁ֛ר
phrase לֹא־
phrase מֵעַמְּךָ֥ יִשְׂרָאֵ֖ל
phrase ה֑וּא
clause WQt0 Coor
phrase וּ
phrase בָ֛א
phrase מֵאֶ֥רֶץ רְחֹוקָ֖ה
phrase לְמַ֥עַן שְׁמֶֽךָ׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
verse
sentence 92
clause WXYq NA
phrase אַתָּ֞ה
phrase תִּשְׁמַ֤ע
phrase הַשָּׁמַ֨יִם֙ מְכֹ֣ון שִׁבְתֶּ֔ךָ
sentence 94
clause WQt0 NA
phrase וְ
phrase עָשִׂ֕יתָ
phrase כְּכֹ֛ל
clause xYqX RgRc
phrase אֲשֶׁר־
phrase יִקְרָ֥א
phrase אֵלֶ֖יךָ
phrase הַנָּכְרִ֑י
sentence 95
clause xYqX NA
phrase לְמַ֣עַן
phrase יֵדְעוּן֩
phrase כָּל־עַמֵּ֨י הָאָ֜רֶץ
phrase אֶת־שְׁמֶ֗ךָ
clause InfC Adju
phrase לְיִרְאָ֤ה
phrase אֹֽתְךָ֙
phrase כְּעַמְּךָ֣ יִשְׂרָאֵ֔ל
clause InfC Coor
phrase וְ
phrase לָדַ֕עַת
clause XQtl Objc
phrase כִּי־
phrase שִׁמְךָ֣
phrase נִקְרָ֔א
phrase עַל־הַבַּ֥יִת הַזֶּ֖ה
clause xQt0 Attr
phrase אֲשֶׁ֥ר
phrase בָּנִֽיתִי׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(resultsX)\n", "A.show(resultsX, baseTypes=\"clause_atom\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more roundabout way to find the same clauses:" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T08:00:20.987274Z", "start_time": "2018-05-24T08:00:17.973289Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.04s 50 results\n" ] } ], "source": [ "query = \"\"\"\n", "clause\n", " =: first:word\n", " last:word\n", " :=\n", "v1:verse\n", " w1:word\n", "v2:verse\n", " w2:word\n", "\n", "first = w1\n", "last = w2\n", "v1 # v2\n", "\"\"\"\n", "results = A.search(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these verse spanning clauses do not have mothers or are not mothers. Let's count the cases where two clauses\n", "are in a mother relation and at least one of them spans a verse.\n", "\n", "We need two queries for that. These queries are almost similar. One retrieves the clause pairs where the mother\n", "crosses verse boundaries, and the other where the daughter does so.\n", "\n", "But we are programmers. We do not have to repeat ourselves:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 26 spanners are missing\n", " 26 missing cases were detected before\n", " 0 is the resulting disagreement\n" ] } ], "source": [ "queryCommon = \"\"\"\n", "c1:clause\n", "-mother> c2:clause\n", "\n", "c3:clause\n", "/with/\n", "v1:verse\n", "&& ..\n", "v2:verse\n", "&& ..\n", "v1 < v2\n", "/-/\n", "\"\"\"\n", "\n", "query1 = f\"\"\"\n", "{queryCommon}\n", "c1 = c3\n", "\"\"\"\n", "query2 = f\"\"\"\n", "{queryCommon}\n", "c2 = c3\n", "\"\"\"\n", "\n", "results1 = A.search(query1, silent=True)\n", "results2 = A.search(query2, silent=True)\n", "spannersByQuery = {(r[0], r[1]) for r in results1 + results2}\n", "print(f\"{len(spannersByQuery):>3} spanners are missing\")\n", "print(f\"{discrepancy:>3} missing cases were detected before\")\n", "print(f\"{discrepancy - len(spannersByQuery):>3} is the resulting disagreement\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may find the mother clause pairs in which it least one member is verse spanning by hand-coding in an easier way:\n", "\n", "Starting with the set of all mother pairs, we filter out any pair that has a verse spanner." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "26" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spannersByHand = set()\n", "\n", "for (c1, c2) in allMotherPairs:\n", " if not (L.u(c1, otype=\"verse\") and L.u(c2, otype=\"verse\")):\n", " spannersByHand.add((c1, c2))\n", "\n", "len(spannersByHand)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, to be completely sure:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spannersByHand == spannersByQuery" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### By custom sets\n", "\n", "If we are content with the clauses that do not span verses,\n", "we can put them in a set, and modify the queries by replacing `clause` by `conclause`\n", "and bind the right set to it.\n", "\n", "Here we go. In one cell we run the queries to get all pairs, the mother-daughter-in-separate-verses pairs,\n", "and the mother-daughter-in-same-verses pair and we do the math of checking." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All pairs\n", " 0.07s 13891 results\n", "Different verse pairs\n", " 0.09s 710 results\n", "Same verse pairs\n", " 0.11s 13181 results\n", "Intersection same-verse/different-verse pairs: set()\n", "All pairs is union of same-verse/different-verse pairs: True\n" ] } ], "source": [ "conClauses = {c for c in F.otype.s(\"clause\") if L.u(c, otype=\"verse\")}\n", "customSets = dict(conclause=conClauses)\n", "\n", "print(\"All pairs\")\n", "allPairs = A.search(\n", " \"\"\"\n", "conclause\n", "-mother> conclause\n", "\"\"\",\n", " sets=customSets,\n", ")\n", "\n", "print(\"Different verse pairs\")\n", "diffPairs = A.search(\n", " \"\"\"\n", "cm:conclause\n", "-mother> cd:conclause\n", "\n", "v1:verse\n", "v2:verse\n", "v1 # v2\n", "\n", "cm ]] v1\n", "cd ]] v2\n", "\"\"\",\n", " sets=customSets,\n", ")\n", "\n", "print(\"Same verse pairs\")\n", "samePairs = A.search(\n", " \"\"\"\n", "cm:conclause\n", "-mother> cd:conclause\n", "\n", "v1:verse\n", "v2:verse\n", "v1 = v2\n", "\n", "cm ]] v1\n", "cd ]] v2\n", "\"\"\",\n", " sets=customSets,\n", ")\n", "\n", "allPairSet = set(allPairs)\n", "diffPairSet = {(r[0], r[1]) for r in diffPairs}\n", "samePairSet = {(r[0], r[1]) for r in samePairs}\n", "\n", "print(f\"Intersection same-verse/different-verse pairs: {samePairSet & diffPairSet}\")\n", "print(\n", " f\"All pairs is union of same-verse/different-verse pairs: {allPairSet == (samePairSet | diffPairSet)}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lessons\n", "\n", "* mix programming with composing queries;\n", "* a good way to do so is custom sets;\n", "* use programming for processing results;\n", "* find the balance between queries and hand-coding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# All steps\n", "\n", "* **[start](start.ipynb)** your first step in mastering the bible computationally\n", "* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures\n", "* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n", "\n", "---\n", "\n", "[advanced](searchAdvanced.ipynb)\n", "[sets](searchSets.ipynb)\n", "[relations](searchRelations.ipynb)\n", "[quantifiers](searchQuantifiers.ipynb)\n", "[from MQL](searchFromMQL.ipynb)\n", "[rough](searchRough.ipynb)\n", "gaps\n", "\n", "You have now finished the search tutorial.\n", "\n", "Share the work!\n", "\n", "---\n", "\n", "* **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n", "* **[share](share.ipynb)** draw in other people's data and let them use yours\n", "* **[export](export.ipynb)** export your dataset as an Emdros database\n", "* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features\n", "* **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus\n", "* **[volumes](volumes.ipynb)** work with selected books only\n", "* **[trees](trees.ipynb)** work with the BHSA data as syntax trees\n", "\n", "CC-BY Dirk Roorda" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }