{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "You might want to consider the [start](search.ipynb) of this tutorial.\n", "\n", "Short introductions to other TF datasets:\n", "\n", "* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),\n", "* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),\n", "or the\n", "* [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Trees\n", "\n", "The textual objects of the BHSA text are syntactic, but they are not syntax trees.\n", "\n", "The BHSA is the result of a data-driven parsing strategy with occasional human decisions.\n", "It results in functional objects such as sentences, clauses, and phrases,\n", "which are build from chunks called sentence-atoms, clause-atoms, and phrase-atoms.\n", "\n", "There is no deeper nesting of clauses within phrases, or even clauses within clauses or phrases within phrases.\n", "Instead, whenever objects are linguistically nested, there is an edge called `mother` between the\n", "objects in question.\n", "\n", "For people that prefer to think in trees, we have unwrapped the `mother` relationship between clauses\n", "and made tree structures out of the data.\n", "\n", "The whole generation process of trees, including the quirks underway, is documented\n", "in the notebook\n", "[trees.ipynb](https://nbviewer.jupyter.org/github/etcbc/trees/blob/master/programs/trees.ipynb).\n", "You see it done there for version 2017.\n", "We have used an ordinary Python program to generate trees for all versions of the BHSA:\n", "[alltrees.py](https://github.com/etcbc/trees/blob/master/programs/alltrees.py)\n", "\n", "Those trees are available as a feature on sentence nodes, and you can load those features\n", "alongside the BHSA data.\n", "\n", "Here we show some examples of what you can do with it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Incantation\n", "\n", "The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are\n", "explained in the [start tutorial](start.ipynb)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:39.818664Z", "start_time": "2018-05-24T10:06:39.796588Z" } }, "outputs": [], "source": [ "from utils import structure, layout\n", "from tf.app import use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we load the trees module.\n", "\n", "We also load the morphology of Open Scriptures for example usage later on." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:51.615044Z", "start_time": "2018-05-24T10:06:50.161456Z" } }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/trees/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bridging/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.89s T osm from ~/text-fabric-data/github/ETCBC/bridging/tf/2021\n", " | 0.12s T osm_sf from ~/text-fabric-data/github/ETCBC/bridging/tf/2021\n", " | 0.21s T tree from ~/text-fabric-data/github/ETCBC/trees/tf/2021\n", " | 0.30s T treen from ~/text-fabric-data/github/ETCBC/trees/tf/2021\n" ] }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 12.0.4, ETCBC/bhsa/app v3, Search Reference
\n", " Data: ETCBC - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
ETCBC/bridging/tf\n", "
\n", "\n", "
\n", "
\n", "osm\n", "
\n", "
str
\n", "\n", " 🆗 morphology tag (primary morpheme) by OpenScriptures (HR HVqp3ms)\n", "\n", "
\n", "\n", "
\n", "
\n", "osm_sf\n", "
\n", "
str
\n", "\n", " 🆗 morphology tag (secundary morpheme) by OpenScriptures\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
ETCBC/trees/tf\n", "
\n", "\n", "
\n", "
\n", "tree\n", "
\n", "
str
\n", "\n", " 🆗 sentence: penn treebank ((VP(vb 2)))\n", "\n", "
\n", "\n", "
\n", "
\n", "treen\n", "
\n", "
str
\n", "\n", " 🆗 sentence: penn treebank with node numbers included ((VP{651574}(vb 2)))\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: ETCBC/bhsa
  3. appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
  4. commit: gd905e3fb6e80d0fa537600337614adc2af157309
  5. css: ''
  6. dataDisplay:
    • exampleSectionHtml:<code>Genesis 1:1</code> (use <a href=\"https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf\" target=\"_blank\">English book names</a>)
    • excludedFeatures:
      • g_uvf_utf8
      • g_vbs
      • kq_hybrid
      • languageISO
      • g_nme
      • lex0
      • is_root
      • g_vbs_utf8
      • g_uvf
      • dist
      • root
      • suffix_person
      • g_vbe
      • dist_unit
      • suffix_number
      • distributional_parent
      • kq_hybrid_utf8
      • crossrefSET
      • instruction
      • g_prs
      • lexeme_count
      • rank_occ
      • g_pfm_utf8
      • freq_occ
      • crossrefLCS
      • functional_parent
      • g_pfm
      • g_nme_utf8
      • g_vbe_utf8
      • kind
      • g_prs_utf8
      • suffix_gender
      • mother_object_type
    • noneValues:
      • none
      • unknown
      • no value
      • NA
  7. docs:
    • docBase: {docRoot}/{repo}
    • docExt: ''
    • docPage: ''
    • docRoot: https://{org}.github.io
    • featurePage: 0_home
  8. interfaceDefaults: {}
  9. isCompatible: True
  10. local: local
  11. localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
  12. provenanceSpec:
    • corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
    • doi: 10.5281/zenodo.1007624
    • moduleSpecs:
      • :
        • backend: no value
        • corpus: Phonetic Transcriptions
        • docUrl:https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
        • doi: 10.5281/zenodo.1007636
        • org: ETCBC
        • relative: /tf
        • repo: phono
      • :
        • backend: no value
        • corpus: Parallel Passages
        • docUrl:https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
        • doi: 10.5281/zenodo.1007642
        • org: ETCBC
        • relative: /tf
        • repo: parallels
    • org: ETCBC
    • relative: /tf
    • repo: bhsa
    • version: 2021
    • webBase: https://shebanq.ancient-data.org/hebrew
    • webHint: Show this on SHEBANQ
    • webLang: la
    • webLexId: True
    • webUrl:{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
    • webUrlLex: {webBase}/word?version={version}&id=<lid>
  13. release: v1.8
  14. typeDisplay:
    • clause:
      • label: {typ} {rela}
      • style: ''
    • clause_atom:
      • hidden: True
      • label: {code}
      • level: 1
      • style: ''
    • half_verse:
      • hidden: True
      • label: {label}
      • style: ''
      • verselike: True
    • lex:
      • featuresBare: gloss
      • label: {voc_lex_utf8}
      • lexOcc: word
      • style: orig
      • template: {voc_lex_utf8}
    • phrase:
      • label: {typ} {function}
      • style: ''
    • phrase_atom:
      • hidden: True
      • label: {typ} {rela}
      • level: 1
      • style: ''
    • sentence:
      • label: {number}
      • style: ''
    • sentence_atom:
      • hidden: True
      • label: {number}
      • level: 1
      • style: ''
    • subphrase:
      • hidden: True
      • label: {number}
      • style: ''
    • word:
      • features: pdp vs vt
      • featuresBare: lex:gloss
  15. writing: hbo
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\", mod=\"ETCBC/trees/tf,ETCBC/bridging/tf\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first inspect the nature of these features, lets pick the first, last and middle sentence of\n", "the Hebrew Bible" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sentences = F.otype.s(\"sentence\")\n", "examples = (sentences[0], sentences[len(sentences) // 2], sentences[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We examine feature `tree`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))\n", "(S(C(VP(vb 0))(NP(n 1))))\n", "(S(C(CP(cj 0))(VP(vb 1))))\n" ] } ], "source": [ "for s in examples:\n", " print(F.tree.v(s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now `treen`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S{1172308}(C{427559}(PP{651573}(pp 0)(n 1))(VP{651574}(vb 2))(NP{651575}(n 3))(PP{651576}(U{1300539}(pp 4)(dt 5)(n 6))(cj 7)(U{1300540}(pp 8)(dt 9)(n 10)))))\n", "(S{1204166}(C{471249}(VP{782581}(vb 0))(NP{782582}(n 1))))\n", "(S{1236024}(C{515689}(CP{904774}(cj 0))(VP{904775}(vb 1))))\n" ] } ], "source": [ "for s in examples:\n", " print(F.treen.v(s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure of the trees is the same, but `treen` has numbers between braces in the tags of the nodes.\n", "These numbers are the Text-Fabric nodes of the sentences, clauses and phrases that the nodes of the tree\n", "correspond to." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Using trees\n", "\n", "These strings are not very pleasant to the eye.\n", "For one thing, we see numbers instead of words.\n", "They also seem a bit unwieldy to integrate with the usual text-fabric business.\n", "But nothing is farther from the truth.\n", "\n", "We show how to\n", "\n", "* produce a multiline view\n", "* see the words (in several representations)\n", "* add a gloss\n", "* add morphological data from an other project (**Open Scriptures**)\n", "\n", "Honesty compels us to note that we make use of a bunch of auxiliary functions in an\n", "accompanying `utils` package:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-04-14T08:44:09.565512Z", "start_time": "2018-04-14T08:44:09.559819Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Job 3:16 - first word = 336990\n", "\n", "tree =\n", "(S(C(Ccoor(CP(cj 0))(PP(pp 1)(U(n 2))(U(vb 3)))(NegP(ng 4))(VP(vb 5)))(Ccoor(PP(pp 6)(n 7)(Cattr(NegP(ng 8))(VP(vb 9))(NP(n 10)))))))\n" ] } ], "source": [ "passage = (\"Job\", 3, 16)\n", "passageStr = \"{} {}:{}\".format(*passage)\n", "verse = T.nodeFromSection(passage)\n", "sentence = L.d(verse, otype=\"sentence\")[0]\n", "firstSlot = L.d(sentence, otype=\"word\")[0]\n", "stringTree = F.tree.v(sentence)\n", "print(f\"{passageStr} - first word = {firstSlot}\\n\\ntree =\\n{stringTree}\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Parsing\n", "\n", "Key to effective manipulation of tree strings is to parse them into tree structures: lists of lists.\n", "\n", "Here we use the generic utility `structure()`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-04-14T08:44:11.848250Z", "start_time": "2018-04-14T08:44:11.837350Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['S',\n", " ['C',\n", " ['Ccoor',\n", " ['CP', [('cj', 0)]],\n", " ['PP', [('pp', 1)], ['U', [('n', 2)]], ['U', [('vb', 3)]]],\n", " ['NegP', [('ng', 4)]],\n", " ['VP', [('vb', 5)]]],\n", " ['Ccoor',\n", " ['PP',\n", " [('pp', 6)],\n", " [('n', 7)],\n", " ['Cattr',\n", " ['NegP', [('ng', 8)]],\n", " ['VP', [('vb', 9)]],\n", " ['NP', [('n', 10)]]]]]]]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = structure(stringTree)\n", "tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Apply layout\n", "\n", "Having the real tree structure in hand, we can layout it in all kinds of ways.\n", "We use the generic utility `layout()` to\n", "display it a bit more friendly and to replace the numbers by real Text-Fabric slot numbers:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-04-14T08:44:13.740602Z", "start_time": "2018-04-14T08:44:13.736048Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S\n", " C\n", " Ccoor\n", " CP\n", " cj 336990\n", " PP\n", " pp 336991\n", " U\n", " n 336992\n", " U\n", " vb 336993\n", " NegP\n", " ng 336994\n", " VP\n", " vb 336995\n", " Ccoor\n", " PP\n", " pp 336996\n", " n 336997\n", " Cattr\n", " NegP\n", " ng 336998\n", " VP\n", " vb 336999\n", " NP\n", " n 337000\n" ] } ], "source": [ "print(layout(tree, firstSlot, str))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That opens up the way to get the words in.\n", "The third argument of `layout()` above is `str`, which is a function that is applied to the slot numbers.\n", "It returns those numbers as string, and this is what ends up in the layout." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filling in the words\n", "\n", "We can pass any function, why not the function that looks up the word?\n", "\n", "Remember that `F.g_word_utf8.v` is a function that returns the full Hebrew word given a slot node." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-04-14T08:44:13.740602Z", "start_time": "2018-04-14T08:44:13.736048Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S\n", " C\n", " Ccoor\n", " CP\n", " cj אֹ֚ו\n", " PP\n", " pp כְ\n", " U\n", " n נֵ֣פֶל\n", " U\n", " vb טָ֭מוּן\n", " NegP\n", " ng לֹ֣א\n", " VP\n", " vb אֶהְיֶ֑ה\n", " Ccoor\n", " PP\n", " pp כְּ֝\n", " n עֹלְלִ֗ים\n", " Cattr\n", " NegP\n", " ng לֹא\n", " VP\n", " vb רָ֥אוּ\n", " NP\n", " n אֹֽור\n" ] } ], "source": [ "print(layout(tree, firstSlot, F.g_word_utf8.v))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add a gloss" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-04-14T08:44:17.703769Z", "start_time": "2018-04-14T08:44:17.697447Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S\n", " C\n", " Ccoor\n", " CP\n", " cj אֹ֚ו \"or\"\n", " PP\n", " pp כְ \"as\"\n", " U\n", " n נֵ֣פֶל \"miscarriage\"\n", " U\n", " vb טָ֭מוּן \"hide\"\n", " NegP\n", " ng לֹ֣א \"not\"\n", " VP\n", " vb אֶהְיֶ֑ה \"be\"\n", " Ccoor\n", " PP\n", " pp כְּ֝ \"as\"\n", " n עֹלְלִ֗ים \"child\"\n", " Cattr\n", " NegP\n", " ng לֹא \"not\"\n", " VP\n", " vb רָ֥אוּ \"see\"\n", " NP\n", " n אֹֽור \"light\"\n" ] } ], "source": [ "def gloss(n):\n", " lexNode = L.u(n, otype=\"lex\")[0]\n", " return f'{F.g_word_utf8.v(n)} \"{F.gloss.v(lexNode)}\"'\n", "\n", "\n", "print(layout(tree, firstSlot, gloss))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Morphology\n", "\n", "In 2018 I compared the morphology of Open Scriptures with that of the BHSA.\n", "See [bridging](https://nbviewer.jupyter.org/github/ETCBC/bridging/blob/master/programs/BHSAbridgeOSM.ipynb).\n", "\n", "As a by-product I saved their morphology as a Text-Fabric feature on words.\n", "So we can add it to our trees.\n", "\n", "We also show the nesting depth in the resulting tree." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-04-14T08:44:17.703769Z", "start_time": "2018-04-14T08:44:17.697447Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1 S\n", " 2 C\n", " 3 Ccoor\n", " 4 CP\n", " 5 cj (HC) אֹ֚ו [ˈʔô] \"or\"\n", " 4 PP\n", " 5 pp (HR) כְ [ḵᵊ] \"as\"\n", " 5 U\n", " 6 n (HNcmsa) נֵ֣פֶל [nˈēfel] \"miscarriage\"\n", " 5 U\n", " 6 vb (HVqsmsa) טָ֭מוּן [ˈṭāmûn] \"hide\"\n", " 4 NegP\n", " 5 ng (HTn) לֹ֣א [lˈō] \"not\"\n", " 4 VP\n", " 5 vb (HVqi1cs) אֶהְיֶ֑ה [ʔehyˈeh] \"be\"\n", " 3 Ccoor\n", " 4 PP\n", " 5 pp (HR) כְּ֝ [ˈkᵊ] \"as\"\n", " 5 n (HNcmpa) עֹלְלִ֗ים [ʕōlᵊlˈîm] \"child\"\n", " 5 Cattr\n", " 6 NegP\n", " 7 ng (HTn) לֹא [lō-] \"not\"\n", " 6 VP\n", " 7 vb (HVqp3cp) רָ֥אוּ [rˌāʔû] \"see\"\n", " 6 NP\n", " 7 n (HNcbsa) אֹֽור [ʔˈôr] \"light\"\n" ] } ], "source": [ "def osmPhonoGloss(n):\n", " lexNode = L.u(n, otype=\"lex\")[0]\n", " return (\n", " f'({F.osm.v(n)}) {F.g_word_utf8.v(n)} [{F.phono.v(n)}] \"{F.gloss.v(lexNode)}\"'\n", " )\n", "\n", "\n", "print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Taking it further\n", "\n", "We saw how the fact that we have slot numbers in our tree structures opens up all kinds of\n", "possibilities for further processing.\n", "\n", "However, so far, we have only made use of slot nodes.\n", "\n", "What if we want to draw in side information for the non-terminal nodes?\n", "\n", "That is where the feature `treen` comes in.\n", "It has node information for all non-terminals between braces, so it is fairly easy to write\n", "new `structure()` and `layout()` functions that exploit them.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# All steps\n", "\n", "* **[start](start.ipynb)** your first step in mastering the bible computationally\n", "* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures\n", "* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n", "* **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n", "* **[share](share.ipynb)** draw in other people's data and let them use yours\n", "* **[export](export.ipynb)** export your dataset as an Emdros database\n", "* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features\n", "* **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus\n", "* **[volumes](volumes.ipynb)** work with selected books only\n", "* **trees** work with the BHSA data as syntax trees\n", "\n", "CC-BY Dirk Roorda" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }