{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "# Tutorial\n", "\n", "This notebook gets you started with using\n", "[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the Hebrew Bible.\n", "\n", "Familiarity with the underlying\n", "[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)\n", "is recommended.\n", "\n", "Short introductions to other TF datasets:\n", "\n", "* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),\n", "* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),\n", "or the\n", "* [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Installing Text-Fabric\n", "\n", "See [here](https://annotation.github.io/text-fabric/tf/about/install.html)" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Tip\n", "If you start computing with this tutorial, first copy its parent directory to somewhere else,\n", "outside your repository.\n", "If you pull changes from the repository later, your work will not be overwritten.\n", "Where you put your tutorial directory is up to you.\n", "It will work from any directory." ] }, { "cell_type": "markdown", "metadata": { "incorrectly_encoded_metadata": "jp-MarkdownHeadingCollapsed=true", "tags": [] }, "source": [ "## BHSA data\n", "\n", "Text-Fabric will fetch a standard set of features for you from the newest GitHub release binaries.\n", "\n", "It will fetch version `2021`.\n", "\n", "The data will be stored in the `text-fabric-data` in your home directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Incantation\n", "\n", "The simplest way to get going is by this *incantation*:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:17.537171Z", "start_time": "2018-05-18T09:17:17.517809Z" } }, "outputs": [], "source": [ "from tf.app import use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the very last version, use `hot`.\n", "\n", "For the latest release, use `latest`.\n", "\n", "If you have cloned the repos (TF app and data), use `clone`.\n", "\n", "If you do not want/need to upgrade, leave out the checkout specifiers." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " TF: TF API 12.2.6, ETCBC/bhsa/app v3, Search Reference
\n", " Data: ETCBC - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: ETCBC/bhsa
  3. appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
  4. commit: gb112c161cfd21eae403d51a2733740d8743460e7
  5. css: ''
  6. dataDisplay:
    • exampleSectionHtml:<code>Genesis 1:1</code> (use <a href=\"https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf\" target=\"_blank\">English book names</a>)
    • excludedFeatures:
      • g_uvf_utf8
      • g_vbs
      • kq_hybrid
      • languageISO
      • g_nme
      • lex0
      • is_root
      • g_vbs_utf8
      • g_uvf
      • dist
      • root
      • suffix_person
      • g_vbe
      • dist_unit
      • suffix_number
      • distributional_parent
      • kq_hybrid_utf8
      • crossrefSET
      • instruction
      • g_prs
      • lexeme_count
      • rank_occ
      • g_pfm_utf8
      • freq_occ
      • crossrefLCS
      • functional_parent
      • g_pfm
      • g_nme_utf8
      • g_vbe_utf8
      • kind
      • g_prs_utf8
      • suffix_gender
      • mother_object_type
    • noneValues:
      • absent
      • n/a
      • none
      • unknown
      • no value
      • NA
  7. docs:
    • docBase: {docRoot}/{repo}
    • docExt: ''
    • docPage: ''
    • docRoot: https://{org}.github.io
    • featurePage: 0_home
  8. interfaceDefaults: {}
  9. isCompatible: True
  10. local: local
  11. localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
  12. provenanceSpec:
    • corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
    • doi: 10.5281/zenodo.1007624
    • extraData: ner
    • moduleSpecs:
      • :
        • backend: no value
        • corpus: Phonetic Transcriptions
        • docUrl:https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
        • doi: 10.5281/zenodo.1007636
        • org: ETCBC
        • relative: /tf
        • repo: phono
      • :
        • backend: no value
        • corpus: Parallel Passages
        • docUrl:https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
        • doi: 10.5281/zenodo.1007642
        • org: ETCBC
        • relative: /tf
        • repo: parallels
    • org: ETCBC
    • relative: /tf
    • repo: bhsa
    • version: 2021
    • webBase: https://shebanq.ancient-data.org/hebrew
    • webHint: Show this on SHEBANQ
    • webLang: la
    • webLexId: True
    • webUrl:{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
    • webUrlLex: {webBase}/word?version={version}&id=<lid>
  13. release: v1.8.1
  14. typeDisplay:
    • clause:
      • label: {typ} {rela}
      • style: ''
    • clause_atom:
      • hidden: True
      • label: {code}
      • level: 1
      • style: ''
    • half_verse:
      • hidden: True
      • label: {label}
      • style: ''
      • verselike: True
    • lex:
      • featuresBare: gloss
      • label: {voc_lex_utf8}
      • lexOcc: word
      • style: orig
      • template: {voc_lex_utf8}
    • phrase:
      • label: {typ} {function}
      • style: ''
    • phrase_atom:
      • hidden: True
      • label: {typ} {rela}
      • level: 1
      • style: ''
    • sentence:
      • label: {number}
      • style: ''
    • sentence_atom:
      • hidden: True
      • label: {number}
      • level: 1
      • style: ''
    • subphrase:
      • hidden: True
      • label: {number}
      • style: ''
    • word:
      • features: pdp vs vt
      • featuresBare: lex:gloss
  15. writing: hbo
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Features\n", "The data of the BHSA is organized in features.\n", "They are *columns* of data.\n", "Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the\n", "first word, row 2 to the second word, and so on, for all 425,000 words.\n", "\n", "The information which part-of-speech each word is, constitutes a column in that spreadsheet.\n", "The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more\n", "textual objects.\n", "\n", "Instead of putting that information in one big table, the data is organized in separate columns.\n", "We call those columns **features**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see which features have been loaded, and if you click on a feature name, you find its documentation.\n", "If you hover over a name, you see where the feature is located on your system.\n", "\n", "Edge features are marked by ***bold italic*** formatting.\n", "\n", "There are ways to tweak the set of features that is loaded. You can load more and less.\n", "\n", "See [share](share.ipynb) for examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have `phono` features.\n", "The BHSA data has a special 1-1 transcription from Hebrew to ASCII,\n", "but not a *phonetic* transcription.\n", "\n", "I have made a\n", "[notebook](https://github.com/etcbc/phono/blob/master/programs/phono.ipynb)\n", "that tries hard to find phonological representations for all the words.\n", "The result is a *module* in text-fabric format.\n", "We'll encounter that later.\n", "\n", "This module, and the module [etcbc/parallels](https://github.com/etcbc/parallels)\n", "are standard modules of the BHSA app." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the [share](share.ipynb) tutorial or [Data](https://annotation.github.io/text-fabric/tf/about/datasharing.html) how you can add and invoke additional data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API\n", "\n", "The result of the incantation is that we have a bunch of special variables at our disposal\n", "that give us access to the text and data of the Hebrew Bible.\n", "\n", "At this point it is helpful to throw a quick glance at the text-fabric API documentation\n", "(see the links under **API Members** above).\n", "\n", "The most essential thing for now is that we can use `F` to access the data in the features\n", "we've loaded.\n", "But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.\n", "\n", "The **API members** above show you exactly which new names have been inserted in your namespace.\n", "If you click on these names, you go to the API documentation for them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Search\n", "Text-Fabric contains a flexible search engine, that does not only work for the BHSA data,\n", "but also for data that you add to it.\n", "\n", "**Search is the quickest way to come up-to-speed with your data, without too much programming.**\n", "\n", "Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite.\n", "And if you already know MQL queries, you can build from that in\n", "[search From MQL](searchFromMQL.ipynb).\n", "\n", "The real power of search lies in the fact that it is integrated in a programming environment.\n", "You can use programming to:\n", "\n", "* compose dynamic queries\n", "* process query results\n", "\n", "Therefore, the rest of this tutorial is still important when you want to tap that power.\n", "If you continue here, you learn all the basics of data-navigation with Text-Fabric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we start coding, we load some modules that we need underway:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:16.202764Z", "start_time": "2018-05-18T09:17:16.197546Z" } }, "outputs": [], "source": [ "import os\n", "import collections\n", "from itertools import chain" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Counting\n", "\n", "In order to get acquainted with the data, we start with the simple task of counting.\n", "\n", "## Count all nodes\n", "We use the\n", "[`N.walk()` generator](https://annotation.github.io/text-fabric/tf/core/nodes.html#tf.core.nodes.Nodes.walk)\n", "to walk through the nodes.\n", "\n", "We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words.\n", "In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.\n", "\n", "We also mentioned that there are also 1,000,000 more textual objects.\n", "They are the phrases, clauses, sentences, verses, chapters and books.\n", "They also correspond to rows in the big spreadsheet.\n", "\n", "In Text-Fabric we call all these rows *nodes*, and the `N()` generator\n", "carries us through those nodes in the textual order.\n", "\n", "Just one extra thing: the `info` statements generate timed messages.\n", "If you use them instead of `print` you'll get a sense of the amount of time that\n", "the various processing steps typically need." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:43.894153Z", "start_time": "2018-05-18T09:17:43.597128Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting nodes ...\n", " 0.09s 1446831 nodes\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Counting nodes ...\")\n", "\n", "i = 0\n", "for n in N.walk():\n", " i += 1\n", "\n", "A.info(\"{} nodes\".format(i))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you see it: 1,4 M nodes!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What are those million nodes?\n", "Every node has a type, like word, or phrase, sentence.\n", "We know that we have approximately 425,000 words and a million other nodes.\n", "But what exactly are they?\n", "\n", "Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.\n", "`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.\n", "\n", "Here we go!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:47.820323Z", "start_time": "2018-05-18T09:17:47.812328Z" } }, "outputs": [ { "data": { "text/plain": [ "'word'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.slotType" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:48.549430Z", "start_time": "2018-05-18T09:17:48.543371Z" } }, "outputs": [ { "data": { "text/plain": [ "426590" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.maxSlot" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:49.251302Z", "start_time": "2018-05-18T09:17:49.244467Z" } }, "outputs": [ { "data": { "text/plain": [ "1446831" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.maxNode" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:49.922863Z", "start_time": "2018-05-18T09:17:49.916078Z" } }, "outputs": [ { "data": { "text/plain": [ "('book',\n", " 'chapter',\n", " 'lex',\n", " 'verse',\n", " 'half_verse',\n", " 'sentence',\n", " 'sentence_atom',\n", " 'clause',\n", " 'clause_atom',\n", " 'phrase',\n", " 'phrase_atom',\n", " 'subphrase',\n", " 'word')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.all" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:51.782779Z", "start_time": "2018-05-18T09:17:51.774167Z" } }, "outputs": [ { "data": { "text/plain": [ "(('book', 10938.205128205129, 426591, 426629),\n", " ('chapter', 459.1926803013994, 426630, 427558),\n", " ('lex', 46.21776814734561, 1437602, 1446831),\n", " ('verse', 18.377202429673027, 1414389, 1437601),\n", " ('half_verse', 9.442218729940901, 606394, 651572),\n", " ('sentence', 6.6950735282577645, 1172308, 1236024),\n", " ('sentence_atom', 6.612363207985863, 1236025, 1300538),\n", " ('clause', 4.840408028956894, 427559, 515689),\n", " ('clause_atom', 4.7031001940377495, 515690, 606393),\n", " ('phrase', 1.684774666966821, 651573, 904775),\n", " ('phrase_atom', 1.5945382234648566, 904776, 1172307),\n", " ('subphrase', 1.4213614404918753, 1300539, 1414388),\n", " ('word', 1, 1, 426590))" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is interesting: above you see all the textual objects, with the average size of their objects,\n", "the node where they start, and the node where they end." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Count individual object types\n", "This is an intuitive way to count the number of nodes in each type.\n", "Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed\n", "and indented progress messages." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:17:57.806821Z", "start_time": "2018-05-18T09:17:57.558523Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s counting objects ...\n", " | 0.00s 39 books\n", " | 0.00s 929 chapters\n", " | 0.00s 9230 lexs\n", " | 0.00s 23213 verses\n", " | 0.00s 45179 half_verses\n", " | 0.00s 63717 sentences\n", " | 0.00s 64514 sentence_atoms\n", " | 0.01s 88131 clauses\n", " | 0.00s 90704 clause_atoms\n", " | 0.01s 253203 phrases\n", " | 0.01s 267532 phrase_atoms\n", " | 0.01s 113850 subphrases\n", " | 0.02s 426590 words\n", " 0.08s Done\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"counting objects ...\")\n", "\n", "for otype in F.otype.all:\n", " i = 0\n", "\n", " A.indent(level=1, reset=True)\n", "\n", " for n in F.otype.s(otype):\n", " i += 1\n", "\n", " A.info(\"{:>7} {}s\".format(i, otype))\n", "\n", "A.indent(level=0)\n", "A.info(\"Done\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Viewing textual objects\n", "\n", "We use the A API (the extra power) to peek into the corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First some words.\n", "Node 15890 is a word with a dotless shin.\n", "\n", "Node 1002 is a word with a yod after a segol hataf.\n", "\n", "Node 100,000 is just a word slot.\n", "\n", "Let's inspect them and see where they are.\n", "\n", "First the plain view:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'word'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.v(1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:02.282178Z", "start_time": "2018-05-18T09:18:02.274117Z" } }, "outputs": [ { "data": { "text/html": [ "
Genesis 30:18  יִשָּׂשכָֽר׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Genesis 2:18  הֱיֹ֥ות
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Deuteronomy 11:19  בָּ֑ם
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "wordShows = (15890, 1002, 100000)\n", "for word in wordShows:\n", " A.plain(word, withPassage=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can leave out the passage reference:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
יִשָּׂשכָֽר׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
הֱיֹ֥ות
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
בָּ֑ם
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for word in wordShows:\n", " A.plain(word, withPassage=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we show other objects, both with and without passage reference." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "normalShow = dict(\n", " wordShow=wordShows[0],\n", " phraseShow=700000,\n", " clauseShow=500000,\n", " sentenceShow=1200000,\n", " lexShow=1437667,\n", ")\n", "\n", "sectionShow = dict(\n", " verseShow=1420000,\n", " chapterShow=427000,\n", " bookShow=426598,\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**wordShow** = node `15890`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Genesis 30:18  יִשָּׂשכָֽר׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
יִשָּׂשכָֽר׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**phraseShow** = node `700000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Numbers 22:31  אֶת־מַלְאַ֤ךְ יְהוָה֙
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
אֶת־מַלְאַ֤ךְ יְהוָה֙
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**clauseShow** = node `500000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Job 36:27  יָזֹ֖קּוּ מָטָ֣ר לְאֵדֹֽו׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
יָזֹ֖קּוּ מָטָ֣ר לְאֵדֹֽו׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**sentenceShow** = node `1200000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
2_Kings 6:5  אֲהָ֥הּ אֲדֹנִ֖י
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
אֲהָ֥הּ אֲדֹנִ֖י
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**lexShow** = node `1437667`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for (name, n) in normalShow.items():\n", " A.dm(f\"**{name}** = node `{n}`\\n\")\n", " A.plain(n)\n", " A.plain(n, withPassage=False)\n", " A.dm(\"\\n---\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that for section nodes (except verse and half-verse) the `withPassage` has little effect.\n", "The passage is the thing that is hyperlinked. The node is represented as a textual reference to the piece of text\n", "in question." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**chapterShow** = node `427000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Isaiah 37   
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Isaiah 37
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**bookShow** = node `426598`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
1_Samuel   
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
1_Samuel
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for (name, n) in sectionShow.items():\n", " if name == \"verseShow\":\n", " continue\n", " A.dm(f\"**{name}** = node `{n}`\\n\")\n", " A.plain(n)\n", " A.plain(n, withPassage=False)\n", " A.dm(\"\\n---\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also dive into the structure of the textual objects, provided they are not too large.\n", "\n", "The function `pretty` gives a display of the object that a node stands for together with the structure below that node." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**wordShow** = node `15890`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**phraseShow** = node `700000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**clauseShow** = node `500000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**sentenceShow** = node `1200000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sentence
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**lexShow** = node `1437667`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for (name, n) in normalShow.items():\n", " A.dm(f\"**{name}** = node `{n}`\\n\")\n", " A.pretty(n)\n", " A.dm(\"\\n---\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note\n", "* if you click on a word in a pretty display\n", " you go to a page in SHEBANQ that shows a list of all occurrences of this lexeme;\n", "* if you click on the passage, you go to SHEBANQ, to exactly this verse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you need a link to shebanq for just any node:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:15.022571Z", "start_time": "2018-05-18T09:18:15.016639Z" } }, "outputs": [ { "data": { "text/html": [ "1_Samuel 25:24" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "million = 1000000\n", "A.webLink(million)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can show some standard features in the display:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**wordShow** = node `15890`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**phraseShow** = node `700000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
phrase PP Objc
<object marker>pdp=prep
messengerpdp=subs
YHWHpdp=nmpr
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**clauseShow** = node `500000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause ZYq0 Coor
phrase VP Pred
filterpdp=verbvs=qalvt=impf
phrase NP Objc
rainpdp=subs
phrase PP Adju
topdp=prep
<uncertain>pdp=subs
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**sentenceShow** = node `1200000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sentence 21
clause Voct NA
phrase InjP Intj
alaspdp=intj
phrase NP Voct
lordpdp=subs
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**lexShow** = node `1437667`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
lex קָטֹן
small
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for (name, n) in normalShow.items():\n", " A.dm(f\"**{name}** = node `{n}`\\n\")\n", " A.pretty(n, standardFeatures=True)\n", " A.dm(\"\\n---\\n\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**wordShow** = node `15890`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**phraseShow** = node `700000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
phrase PP Objc
<object marker>pdp=prep
messengerpdp=subs
YHWHpdp=nmpr
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**clauseShow** = node `500000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
clause ZYq0 Coor
phrase VP Pred
filterpdp=verbvs=qalvt=impf
phrase NP Objc
rainpdp=subs
phrase PP Adju
topdp=prep
<uncertain>pdp=subs
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**sentenceShow** = node `1200000`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sentence 21
clause Voct NA
phrase InjP Intj
alaspdp=intj
phrase NP Voct
lordpdp=subs
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**lexShow** = node `1437667`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
lex קָטֹן
small
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "---\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for (name, n) in normalShow.items():\n", " A.dm(f\"**{name}** = node `{n}`\\n\")\n", " A.pretty(n, standardFeatures=True)\n", " A.dm(\"\\n---\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more display options, see [display](display.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature statistics\n", "\n", "`F`\n", "gives access to all features.\n", "Every feature has a method\n", "`freqList()`\n", "to generate a frequency list of its values, higher frequencies first.\n", "Here are the parts of speech:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:18.039544Z", "start_time": "2018-05-18T09:18:17.784073Z" } }, "outputs": [ { "data": { "text/plain": [ "(('subs', 125583),\n", " ('verb', 75451),\n", " ('prep', 73298),\n", " ('conj', 62737),\n", " ('nmpr', 35607),\n", " ('art', 30387),\n", " ('adjv', 10141),\n", " ('nega', 6059),\n", " ('prps', 5035),\n", " ('advb', 4603),\n", " ('prde', 2678),\n", " ('intj', 1912),\n", " ('inrg', 1303),\n", " ('prin', 1026))" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.sp.freqList()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lexeme matters\n", "\n", "## Top 10 frequent verbs\n", "\n", "If we count the frequency of words, we usually mean the frequency of their\n", "corresponding lexemes.\n", "\n", "There are several methods for working with lexemes.\n", "\n", "### Method 1: counting words" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:22.590359Z", "start_time": "2018-05-18T09:18:22.247265Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Collecting data\n", " 0.09s Done\n", ">MR[: 5378\n", "HJH[: 3561\n", "[: 2570\n", "NTN[: 2017\n", "HLK[: 1554\n", "R>H[: 1298\n", "CM<[: 1168\n", "DBR[: 1138\n", "JCB[: 1082\n", "\n" ] } ], "source": [ "verbs = collections.Counter()\n", "A.indent(reset=True)\n", "A.info(\"Collecting data\")\n", "\n", "for w in F.otype.s(\"word\"):\n", " if F.sp.v(w) != \"verb\":\n", " continue\n", " verbs[F.lex.v(w)] += 1\n", "\n", "A.info(\"Done\")\n", "print(\n", " \"\".join(\n", " \"{}: {}\\n\".format(verb, cnt)\n", " for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10]\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 2: counting lexemes\n", "\n", "An alternative way to do this is to use the feature `freq_lex`, defined for `lex` nodes.\n", "Now we walk the lexemes instead of the occurrences.\n", "\n", "Note that the feature `sp` (part-of-speech) is defined for nodes of type `word` as well as `lex`.\n", "Both also have the `lex` feature." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:25.695727Z", "start_time": "2018-05-18T09:18:25.667486Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Collecting data\n", " 0.00s Done\n", ">MR[: 5378\n", "HJH[: 3561\n", "[: 2570\n", "NTN[: 2017\n", "HLK[: 1554\n", "R>H[: 1298\n", "CM<[: 1168\n", "DBR[: 1138\n", "JCB[: 1082\n", "\n" ] } ], "source": [ "verbs = collections.Counter()\n", "A.indent(reset=True)\n", "A.info(\"Collecting data\")\n", "for w in F.otype.s(\"lex\"):\n", " if F.sp.v(w) != \"verb\":\n", " continue\n", " verbs[F.lex.v(w)] += F.freq_lex.v(w)\n", "A.info(\"Done\")\n", "print(\n", " \"\".join(\n", " \"{}: {}\\n\".format(verb, cnt)\n", " for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10]\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an order of magnitude faster. In this case, that means the difference between a third of a second and a\n", "hundredth of a second, not a big gain in absolute terms.\n", "But suppose you need to run this a 1000 times in a loop.\n", "Then it is the difference between 5 minutes and 10 seconds.\n", "A five minute wait is not pleasant in interactive computing!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A frequency mapping of lexemes\n", "\n", "We make a mapping between lexeme forms and the number of occurrences of those lexemes." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "lexeme_dict = {F.lex_utf8.v(n): F.freq_lex.v(n) for n in F.otype.s(\"word\")}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('ב', 15542),\n", " ('ראשׁית', 51),\n", " ('ברא', 48),\n", " ('אלהים', 2601),\n", " ('את', 10987),\n", " ('ה', 30386),\n", " ('שׁמים', 421),\n", " ('ו', 50272),\n", " ('ארץ', 2504),\n", " ('היה', 3561)]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(lexeme_dict.items())[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Real work\n", "\n", "As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on\n", "[Collocation Mutual Information Analysis of the Hebrew Bible](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/blob/master/Collocation%20MI%20Analysis%20of%20the%20Hebrew%20Bible.ipynb)\n", "\n", "It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results.\n", "\n", "In case the name has changed, the enclosing repo is\n", "[here](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/tree/master/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lexeme distribution\n", "\n", "Let's do a bit more fancy lexeme stuff.\n", "\n", "### Hapaxes\n", "\n", "A hapax can be found by inspecting lexemes and see to how many word nodes they are linked.\n", "If that is number is one, we have a hapax.\n", "\n", "We print 10 hapaxes with their glosses." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:31.003571Z", "start_time": "2018-05-18T09:18:30.839888Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.04s 3071 hapaxes found\n", "No zeroes found\n", "\tPJCWN/ Pishon\n", "\tCWP[ bruise\n", "\tHRWN/ pregnancy\n", "\tZL/ Mehujael\n", "\tMXJJ>L/ Mehujael\n", "\tJBL=/ Jabal\n" ] } ], "source": [ "A.indent(reset=True)\n", "\n", "hapax = []\n", "zero = set()\n", "\n", "for lx in F.otype.s(\"lex\"):\n", " occs = L.d(lx, otype=\"word\")\n", " n = len(occs)\n", " if n == 0: # that's weird: should not happen\n", " zero.add(lx)\n", " elif n == 1: # hapax found!\n", " hapax.append(lx)\n", "\n", "A.info(\"{} hapaxes found\".format(len(hapax)))\n", "\n", "if zero:\n", " A.error(\"{} zeroes found\".format(len(zero)), tm=False)\n", "else:\n", " A.info(\"No zeroes found\", tm=False)\n", "for h in hapax[0:10]:\n", " print(\"\\t{:<8} {}\".format(F.lex.v(h), F.gloss.v(h)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Small occurrence base\n", "\n", "The occurrence base of a lexeme are the verses, chapters and books in which occurs.\n", "Let's look for lexemes that occur in a single chapter.\n", "\n", "If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter.\n", "So, if you go *up* from the lexeme, you encounter the chapter.\n", "\n", "Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it,\n", "so if you go up from such lexemes, you don not find chapters.\n", "\n", "Let's check it out.\n", "\n", "Oh yes, we have already found the hapaxes, we will skip them here." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:36.257701Z", "start_time": "2018-05-18T09:18:36.082461Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Finding single chapter lexemes\n", " 0.05s 450 single chapter lexemes found\n", "No chapter embedders of multiple lexemes found\n", "Genesis 4:1 QJN=/ \n", "Genesis 4:2 HBL=/ \n", "Genesis 4:18 L/\n", "Genesis 4:19 YLH/ \n", "Genesis 4:22 TWBL_QJN/\n", "Genesis 10:11 KLX=/ \n", "Genesis 14:1 >MRPL/\n", "Genesis 14:1 >RJWK/\n", "Genesis 14:1 >LSR/ \n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Finding single chapter lexemes\")\n", "\n", "singleCh = []\n", "multipleCh = []\n", "\n", "for lx in F.otype.s(\"lex\"):\n", " chapters = L.u(lx, \"chapter\")\n", " if len(chapters) == 1:\n", " if lx not in hapax:\n", " singleCh.append(lx)\n", " elif len(chapters) > 0: # should not happen\n", " multipleCh.append(lx)\n", "\n", "A.info(\"{} single chapter lexemes found\".format(len(singleCh)))\n", "\n", "if multipleCh:\n", " A.error(\n", " \"{} chapter embedders of multiple lexemes found\".format(len(multipleCh)),\n", " tm=False,\n", " )\n", "else:\n", " A.info(\"No chapter embedders of multiple lexemes found\", tm=False)\n", "for s in singleCh[0:10]:\n", " print(\n", " \"{:<20} {:<6}\".format(\n", " \"{} {}:{}\".format(*T.sectionFromNode(s)),\n", " F.lex.v(s),\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confined to books\n", "\n", "As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and\n", "the number of lexemes that occur exclusively in that book." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:43.959960Z", "start_time": "2018-05-18T09:18:39.536067Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Making book-lexeme index\n", " 1.08s Found 9230 lexemes\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Making book-lexeme index\")\n", "\n", "allBook = collections.defaultdict(set)\n", "allLex = set()\n", "\n", "for b in F.otype.s(\"book\"):\n", " for w in L.d(b, \"word\"):\n", " lx = L.u(w, \"lex\")[0]\n", " allBook[b].add(lx)\n", " allLex.add(lx)\n", "\n", "A.info(\"Found {} lexemes\".format(len(allLex)))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:45.949852Z", "start_time": "2018-05-18T09:18:45.892985Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Finding single book lexemes\n", " 0.01s found 4224 single book lexemes\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Finding single book lexemes\")\n", "\n", "singleBook = collections.defaultdict(lambda: 0)\n", "for lx in F.otype.s(\"lex\"):\n", " book = L.u(lx, \"book\")\n", " if len(book) == 1:\n", " singleBook[book[0]] += 1\n", "\n", "A.info(\"found {} single book lexemes\".format(sum(singleBook.values())))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:52.143337Z", "start_time": "2018-05-18T09:18:52.130385Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "book #all #own %own\n", "-----------------------------------\n", "Daniel 1122 428 38.1%\n", "1_Chronicles 2013 487 24.2%\n", "Ezra 991 199 20.1%\n", "Joshua 1175 206 17.5%\n", "Esther 472 67 14.2%\n", "Isaiah 2555 350 13.7%\n", "Numbers 1457 197 13.5%\n", "Ezekiel 1719 212 12.3%\n", "Song_of_songs 503 60 11.9%\n", "Job 1717 202 11.8%\n", "Genesis 1816 208 11.5%\n", "Nehemiah 1076 110 10.2%\n", "Psalms 2250 216 9.6%\n", "Leviticus 960 88 9.2%\n", "Judges 1210 99 8.2%\n", "Ecclesiastes 575 46 8.0%\n", "Proverbs 1356 103 7.6%\n", "Jeremiah 1949 147 7.5%\n", "2_Samuel 1304 89 6.8%\n", "1_Samuel 1256 85 6.8%\n", "2_Kings 1266 85 6.7%\n", "Exodus 1425 92 6.5%\n", "1_Kings 1291 81 6.3%\n", "Deuteronomy 1449 80 5.5%\n", "Lamentations 592 31 5.2%\n", "2_Chronicles 1411 67 4.7%\n", "Nahum 357 16 4.5%\n", "Hosea 742 33 4.4%\n", "Ruth 319 14 4.4%\n", "Habakkuk 393 17 4.3%\n", "Amos 652 27 4.1%\n", "Joel 398 14 3.5%\n", "Zechariah 726 25 3.4%\n", "Obadiah 167 5 3.0%\n", "Micah 586 16 2.7%\n", "Zephaniah 367 10 2.7%\n", "Jonah 252 5 2.0%\n", "Haggai 208 3 1.4%\n", "Malachi 314 4 1.3%\n" ] } ], "source": [ "print(\n", " \"{:<20}{:>5}{:>5}{:>5}\\n{}\".format(\n", " \"book\",\n", " \"#all\",\n", " \"#own\",\n", " \"%own\",\n", " \"-\" * 35,\n", " )\n", ")\n", "booklist = []\n", "\n", "for b in F.otype.s(\"book\"):\n", " book = T.bookName(b)\n", " a = len(allBook[b])\n", " o = singleBook.get(b, 0)\n", " p = 100 * o / a\n", " booklist.append((book, a, o, p))\n", "\n", "for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):\n", " print(\"{:<20} {:>4} {:>4} {:>4.1f}%\".format(*x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The book names may sound a bit unfamiliar, they are in Latin here.\n", "Later we'll see that you can also get them in English, or in Swahili." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Locality API\n", "We travel upwards and downwards, forwards and backwards through the nodes.\n", "The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down,\n", "`n()` for going to next nodes and `p()` for going to previous nodes.\n", "\n", "These directions are indirect notions: nodes are just numbers, but by means of the\n", "`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.\n", "And one if next or previous to an other, if its slots follow or precede the slots of the other one.\n", "\n", "`L.u(node)` **Up** is going to nodes that embed `node`.\n", "\n", "`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.\n", "\n", "`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.\n", "\n", "`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.\n", "\n", "All these functions yield nodes of all possible node types.\n", "By passing an optional parameter, you can restrict the results to nodes of that type.\n", "\n", "The result are ordered according to the order of things in the text.\n", "\n", "The functions return always a tuple, even if there is just one node in the result.\n", "\n", "## Going up\n", "We go from the first word to the book it contains.\n", "Note the `[0]` at the end. You expect one book, yet `L` returns a tuple.\n", "To get the only element of that tuple, you need to do that `[0]`.\n", "\n", "If you are like me, you keep forgetting it, and that will lead to weird error messages later on." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:55.410034Z", "start_time": "2018-05-18T09:18:55.404051Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "426591\n" ] } ], "source": [ "firstBook = L.u(1, otype=\"book\")[0]\n", "print(firstBook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's see all the containing objects of word 3:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:56.772513Z", "start_time": "2018-05-18T09:18:56.766324Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word 3 is contained in book 426591\n", "word 3 is contained in chapter 426630\n", "word 3 is contained in lex 1437604\n", "word 3 is contained in verse 1414389\n", "word 3 is contained in half_verse 606394\n", "word 3 is contained in sentence 1172308\n", "word 3 is contained in sentence_atom 1236025\n", "word 3 is contained in clause 427559\n", "word 3 is contained in clause_atom 515690\n", "word 3 is contained in phrase 651574\n", "word 3 is contained in phrase_atom 904777\n", "word 3 is contained in subphrase x\n" ] } ], "source": [ "w = 3\n", "for otype in F.otype.all:\n", " if otype == F.otype.slotType:\n", " continue\n", " up = L.u(w, otype=otype)\n", " upNode = \"x\" if len(up) == 0 else up[0]\n", " print(\"word {} is contained in {} {}\".format(w, otype, upNode))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going next\n", "Let's go to the next nodes of the first book." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:18:58.821681Z", "start_time": "2018-05-18T09:18:58.814893Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 28765: word first slot=28765 , last slot=28765 \n", " 923533: phrase_atom first slot=28765 , last slot=28765 \n", " 669555: phrase first slot=28765 , last slot=28765 \n", " 521826: clause_atom first slot=28765 , last slot=28769 \n", " 433546: clause first slot=28765 , last slot=28769 \n", " 609394: half_verse first slot=28765 , last slot=28772 \n", "1240671: sentence_atom first slot=28765 , last slot=28774 \n", "1176925: sentence first slot=28765 , last slot=28793 \n", "1415922: verse first slot=28765 , last slot=28778 \n", " 426680: chapter first slot=28765 , last slot=29113 \n", " 426592: book first slot=28765 , last slot=52512 \n" ] } ], "source": [ "afterFirstBook = L.n(firstBook)\n", "for n in afterFirstBook:\n", " print(\n", " \"{:>7}: {:<13} first slot={:<6}, last slot={:<6}\".format(\n", " n,\n", " F.otype.v(n),\n", " E.oslots.s(n)[0],\n", " E.oslots.s(n)[-1],\n", " )\n", " )\n", "secondBook = L.n(firstBook, otype=\"book\")[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going previous\n", "\n", "And let's see what is right before the second book." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:00.163973Z", "start_time": "2018-05-18T09:19:00.154857Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 426591: book first slot=1 , last slot=28764 \n", " 426679: chapter first slot=28260 , last slot=28764 \n", "1415921: verse first slot=28747 , last slot=28764 \n", " 609393: half_verse first slot=28755 , last slot=28764 \n", "1176924: sentence first slot=28758 , last slot=28764 \n", "1240670: sentence_atom first slot=28758 , last slot=28764 \n", " 433545: clause first slot=28758 , last slot=28764 \n", " 521825: clause_atom first slot=28758 , last slot=28764 \n", " 669554: phrase first slot=28763 , last slot=28764 \n", " 923532: phrase_atom first slot=28763 , last slot=28764 \n", " 28764: word first slot=28764 , last slot=28764 \n" ] } ], "source": [ "for n in L.p(secondBook):\n", " print(\n", " \"{:>7}: {:<13} first slot={:<6}, last slot={:<6}\".format(\n", " n,\n", " F.otype.v(n),\n", " E.oslots.s(n)[0],\n", " E.oslots.s(n)[-1],\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going down" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We go to the chapters of the second book, and just count them." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:02.530705Z", "start_time": "2018-05-18T09:19:02.475279Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40\n" ] } ], "source": [ "chapters = L.d(secondBook, otype=\"chapter\")\n", "print(len(chapters))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The first verse\n", "We pick the first verse and the first word, and explore what is above and below them." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:04.024679Z", "start_time": "2018-05-18T09:19:03.995207Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Node 1\n", " | UP\n", " | | 1437602 lex\n", " | | 904776 phrase_atom\n", " | | 651573 phrase\n", " | | 606394 half_verse\n", " | | 515690 clause_atom\n", " | | 427559 clause\n", " | | 1236025 sentence_atom\n", " | | 1172308 sentence\n", " | | 1414389 verse\n", " | | 426630 chapter\n", " | | 426591 book\n", " | DOWN\n", " | | \n", "Node 1414389\n", " | UP\n", " | | 515690 clause_atom\n", " | | 427559 clause\n", " | | 1236025 sentence_atom\n", " | | 1172308 sentence\n", " | | 426630 chapter\n", " | | 426591 book\n", " | DOWN\n", " | | 1172308 sentence\n", " | | 1236025 sentence_atom\n", " | | 427559 clause\n", " | | 515690 clause_atom\n", " | | 606394 half_verse\n", " | | 651573 phrase\n", " | | 904776 phrase_atom\n", " | | 1 word\n", " | | 2 word\n", " | | 651574 phrase\n", " | | 904777 phrase_atom\n", " | | 3 word\n", " | | 651575 phrase\n", " | | 904778 phrase_atom\n", " | | 4 word\n", " | | 606395 half_verse\n", " | | 651576 phrase\n", " | | 904779 phrase_atom\n", " | | 1300539 subphrase\n", " | | 5 word\n", " | | 6 word\n", " | | 7 word\n", " | | 8 word\n", " | | 1300540 subphrase\n", " | | 9 word\n", " | | 10 word\n", " | | 11 word\n", "Done\n" ] } ], "source": [ "for n in [1, L.u(1, otype=\"verse\")[0]]:\n", " A.indent(level=0)\n", " A.info(\"Node {}\".format(n), tm=False)\n", " A.indent(level=1)\n", " A.info(\"UP\", tm=False)\n", " A.indent(level=2)\n", " A.info(\"\\n\".join([\"{:<15} {}\".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)\n", " A.indent(level=1)\n", " A.info(\"DOWN\", tm=False)\n", " A.indent(level=2)\n", " A.info(\"\\n\".join([\"{:<15} {}\".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)\n", "A.indent(level=0)\n", "A.info(\"Done\", tm=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text API\n", "\n", "So far, we have mainly seen nodes and their numbers, and the names of node types.\n", "You would almost forget that we are dealing with text.\n", "So let's try to see some text.\n", "\n", "In the same way as `F` gives access to feature data,\n", "`T` gives access to the text.\n", "That is also feature data, but you can tell Text-Fabric which features are specifically\n", "carrying the text, and in return Text-Fabric offers you\n", "a Text API: `T`.\n", "\n", "## Formats\n", "Hebrew text can be represented in a number of ways:\n", "\n", "* fully pointed (vocalized and accented), or consonantal,\n", "* in transliteration, phonetic transcription or in Hebrew characters,\n", "* showing the actual text or only the lexemes,\n", "* following the ketiv or the qere, at places where they deviate from each other.\n", "\n", "If you wonder where the information about text formats is stored:\n", "not in the program text-fabric, but in the data set.\n", "It has a feature `otext`, which specifies the formats and which features\n", "must be used to produce them. `otext` is the third special feature in a TF data set,\n", "next to `otype` and `oslots`.\n", "It is an optional feature.\n", "If it is absent, there will be no `T` API.\n", "\n", "Here is a list of all available formats in this data set." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:05.606582Z", "start_time": "2018-05-18T09:19:05.593486Z" } }, "outputs": [ { "data": { "text/plain": [ "['lex-default',\n", " 'lex-orig-full',\n", " 'lex-orig-plain',\n", " 'lex-trans-full',\n", " 'lex-trans-plain',\n", " 'text-orig-full',\n", " 'text-orig-full-ketiv',\n", " 'text-orig-plain',\n", " 'text-phono-full',\n", " 'text-trans-full',\n", " 'text-trans-full-ketiv',\n", " 'text-trans-plain']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(T.formats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the `text-phono-full` format here.\n", "It does not come from the main data source `bhsa`, but from the module `phono`.\n", "Look in your data directory, find `~/github/etcbc/phono/tf/2017/otext@phono.tf`,\n", "and you'll see this format defined there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the formats\n", "\n", "We can pretty display in other formats:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for word in wordShows:\n", " A.pretty(word, fmt=\"text-phono-full\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## T.text()\n", "\n", "This function is central to get text representations of nodes. Its most basic usage is\n", "\n", "```python\n", "T.text(nodes, fmt=fmt)\n", "```\n", "where `nodes` is a list or iterable of nodes, usually word nodes, and `fmt` is the name of a format.\n", "If you leave out `fmt`, the default `text-orig-full` is chosen.\n", "\n", "The result is the text in that format for all nodes specified:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'בראשׁית ברא אלהים את השׁמים ואת הארץ׃ '" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt=\"text-orig-plain\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also another usage of this function:\n", "\n", "```python\n", "T.text(node, fmt=fmt)\n", "```\n", "\n", "where `node` is a single node.\n", "In this case, the default format is `ntype-orig-full` where `ntype` is the type of `node`.\n", "So for a `lex` node, the default format is `lex-orig-full`.\n", "\n", "If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in `node` will be looked up\n", "and represented with the default format `text-orig-full`.\n", "\n", "In this way we can sensibly represent a lot of different nodes, such as chapters, verses, sentences, words and lexemes.\n", "\n", "We compose a set of example nodes and run `T.text` on them:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 1172308, 1414389, 426630, 1437603]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exampleNodes = [\n", " 1,\n", " F.otype.s(\"sentence\")[0],\n", " F.otype.s(\"verse\")[0],\n", " F.otype.s(\"chapter\")[0],\n", " F.otype.s(\"lex\")[1],\n", "]\n", "exampleNodes" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is word 1:\n", "בְּ\n", "\n", "This is sentence 1172308:\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "\n", "This is verse 1414389:\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "\n", "This is chapter 426630:\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃ וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃ וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖יִם קָרָ֣א יַמִּ֑ים וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיֹּ֣אמֶר אֱלֹהִ֗ים תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב מַזְרִ֣יעַ זֶ֔רַע עֵ֣ץ פְּרִ֞י עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו עַל־הָאָ֑רֶץ וַֽיְהִי־כֵֽן׃ וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא עֵ֣שֶׂב מַזְרִ֤יעַ זֶ֨רַע֙ לְמִינֵ֔הוּ וְעֵ֧ץ עֹ֥שֶׂה פְּרִ֛י אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו לְמִינֵ֑הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שְׁלִישִֽׁי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יְהִ֤י מְאֹרֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם לְהַבְדִּ֕יל בֵּ֥ין הַיֹּ֖ום וּבֵ֣ין הַלָּ֑יְלָה וְהָי֤וּ לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים וּלְיָמִ֖ים וְשָׁנִֽים׃ וְהָי֤וּ לִמְאֹורֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם לְהָאִ֖יר עַל־הָאָ֑רֶץ וַֽיְהִי־כֵֽן׃ וַיַּ֣עַשׂ אֱלֹהִ֔ים אֶת־שְׁנֵ֥י הַמְּאֹרֹ֖ת הַגְּדֹלִ֑ים אֶת־הַמָּאֹ֤ור הַגָּדֹל֙ לְמֶמְשֶׁ֣לֶת הַיֹּ֔ום וְאֶת־הַמָּאֹ֤ור הַקָּטֹן֙ לְמֶמְשֶׁ֣לֶת הַלַּ֔יְלָה וְאֵ֖ת הַכֹּוכָבִֽים׃ וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום רְבִיעִֽי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יִשְׁרְצ֣וּ הַמַּ֔יִם שֶׁ֖רֶץ נֶ֣פֶשׁ חַיָּ֑ה וְעֹוף֙ יְעֹופֵ֣ף עַל־הָאָ֔רֶץ עַל־פְּנֵ֖י רְקִ֥יעַ הַשָּׁמָֽיִם׃ וַיִּבְרָ֣א אֱלֹהִ֔ים אֶת־הַתַּנִּינִ֖ם הַגְּדֹלִ֑ים וְאֵ֣ת כָּל־נֶ֣פֶשׁ הַֽחַיָּ֣ה׀ הָֽרֹמֶ֡שֶׂת אֲשֶׁר֩ שָׁרְצ֨וּ הַמַּ֜יִם לְמִֽינֵהֶ֗ם וְאֵ֨ת כָּל־עֹ֤וף כָּנָף֙ לְמִינֵ֔הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיְבָ֧רֶךְ אֹתָ֛ם אֱלֹהִ֖ים לֵאמֹ֑ר פְּר֣וּ וּרְב֗וּ וּמִלְא֤וּ אֶת־הַמַּ֨יִם֙ בַּיַּמִּ֔ים וְהָעֹ֖וף יִ֥רֶב בָּאָֽרֶץ׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום חֲמִישִֽׁי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים תֹּוצֵ֨א הָאָ֜רֶץ נֶ֤פֶשׁ חַיָּה֙ לְמִינָ֔הּ בְּהֵמָ֥ה וָרֶ֛מֶשׂ וְחַֽיְתֹו־אֶ֖רֶץ לְמִינָ֑הּ וַֽיְהִי־כֵֽן׃ וַיַּ֣עַשׂ אֱלֹהִים֩ אֶת־חַיַּ֨ת הָאָ֜רֶץ לְמִינָ֗הּ וְאֶת־הַבְּהֵמָה֙ לְמִינָ֔הּ וְאֵ֛ת כָּל־רֶ֥מֶשׂ הָֽאֲדָמָ֖ה לְמִינֵ֑הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיֹּ֣אמֶר אֱלֹהִ֔ים נַֽעֲשֶׂ֥ה אָדָ֛ם בְּצַלְמֵ֖נוּ כִּדְמוּתֵ֑נוּ וְיִרְדּוּ֩ בִדְגַ֨ת הַיָּ֜ם וּבְעֹ֣וף הַשָּׁמַ֗יִם וּבַבְּהֵמָה֙ וּבְכָל־הָאָ֔רֶץ וּבְכָל־הָרֶ֖מֶשׂ הָֽרֹמֵ֥שׂ עַל־הָאָֽרֶץ׃ וַיִּבְרָ֨א אֱלֹהִ֤ים׀ אֶת־הָֽאָדָם֙ בְּצַלְמֹ֔ו בְּצֶ֥לֶם אֱלֹהִ֖ים בָּרָ֣א אֹתֹ֑ו זָכָ֥ר וּנְקֵבָ֖ה בָּרָ֥א אֹתָֽם׃ וַיְבָ֣רֶךְ אֹתָם֮ אֱלֹהִים֒ וַיֹּ֨אמֶר לָהֶ֜ם אֱלֹהִ֗ים פְּר֥וּ וּרְב֛וּ וּמִלְא֥וּ אֶת־הָאָ֖רֶץ וְכִבְשֻׁ֑הָ וּרְד֞וּ בִּדְגַ֤ת הַיָּם֙ וּבְעֹ֣וף הַשָּׁמַ֔יִם וּבְכָל־חַיָּ֖ה הָֽרֹמֶ֥שֶׂת עַל־הָאָֽרֶץ׃ וַיֹּ֣אמֶר אֱלֹהִ֗ים הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע לָכֶ֥ם יִֽהְיֶ֖ה לְאָכְלָֽה׃ וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה וַֽיְהִי־כֵֽן׃ וַיַּ֤רְא אֱלֹהִים֙ אֶת־כָּל־אֲשֶׁ֣ר עָשָׂ֔ה וְהִנֵּה־טֹ֖וב מְאֹ֑ד וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום הַשִּׁשִּֽׁי׃ פ \n", "\n", "This is lex 1437603:\n", "רֵאשִׁית \n", "\n" ] } ], "source": [ "for n in exampleNodes:\n", " print(f\"This is {F.otype.v(n)} {n}:\")\n", " print(T.text(n))\n", " print(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the formats\n", "Now let's use those formats to print out the first verse of the Hebrew Bible." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:10.077589Z", "start_time": "2018-05-18T09:19:10.070503Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lex-default:\n", "\tבְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ \n", "lex-orig-full:\n", "\tבְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ \n", "lex-orig-plain:\n", "\tב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ \n", "lex-trans-full:\n", "\tB.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY \n", "lex-trans-plain:\n", "\tB R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ \n", "text-orig-full:\n", "\tבְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "text-orig-full-ketiv:\n", "\tבְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "text-orig-plain:\n", "\tבראשׁית ברא אלהים את השׁמים ואת הארץ׃ \n", "text-phono-full:\n", "\tbᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . \n", "text-trans-full:\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "text-trans-full-ketiv:\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "text-trans-plain:\n", "\tBR>CJT BR> >LHJM >T HCMJM W>T H>RY00 \n" ] } ], "source": [ "for fmt in sorted(T.formats):\n", " print(\"{}:\\n\\t{}\".format(fmt, T.text(range(1, 12), fmt=fmt)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `lex-default` is a format that only works for nodes of type `lex`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we do not specify a format, the **default** format is used (`text-orig-full`)." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:13.490426Z", "start_time": "2018-05-18T09:19:13.486053Z" } }, "outputs": [ { "data": { "text/plain": [ "'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text(range(1, 12))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "firstVerse = F.otype.s(\"verse\")[0]\n", "T.text(firstVerse)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . '" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text(firstVerse, fmt=\"text-phono-full\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The important things to remember are:\n", "\n", "* you can supply a list of word nodes and get them represented in all formats (except `lex-default`)\n", "* you can use `T.text(lx)` for lexeme nodes `lx` and it will give the vocalized lexeme (using format `lex-default`)\n", "* you can get non-word nodes `n` in default format by `T.text(n)`\n", "* you can get non-word nodes `n` in other formats by `T.text(n, fmt=fmt, descend=True)`" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Whole text in all formats\n", "Part of the pleasure of working with computers is that they can crunch massive amounts of data.\n", "The text of the Hebrew Bible is a piece of cake.\n", "\n", "It takes less than ten seconds to have that cake and eat it.\n", "In nearly a dozen formats." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:27.839331Z", "start_time": "2018-05-18T09:19:18.526400Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s writing plain text of whole Bible in all formats ...\n", " 3.09s done 12 formats\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"writing plain text of whole Bible in all formats ...\")\n", "text = collections.defaultdict(list)\n", "for v in F.otype.s(\"verse\"):\n", " for fmt in sorted(T.formats):\n", " text[fmt].append(T.text(v, fmt=fmt, descend=True))\n", "A.info(\"done {} formats\".format(len(text)))" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lex-default\n", "בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ \n", "וְ הַ אֶרֶץ היה תֹּהוּ וְ בֹּהוּ וְ חֹשֶׁךְ עַל פָּנֶה תְּהֹום וְ רוּחַ אֱלֹהִים רחף עַל פָּנֶה הַ מַיִם \n", "וְ אמר אֱלֹהִים היה אֹור וְ היה אֹור \n", "וְ ראה אֱלֹהִים אֵת הַ אֹור כִּי טוב וְ בדל אֱלֹהִים בַּיִן הַ אֹור וְ בַּיִן הַ חֹשֶׁךְ \n", "וְ קרא אֱלֹהִים לְ הַ אֹור יֹום וְ לְ הַ חֹשֶׁךְ קרא לַיְלָה וְ היה עֶרֶב וְ היה בֹּקֶר יֹום אֶחָד \n", "\n", "lex-orig-full\n", "בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ \n", "וְ הָ אָרֶץ הָי תֹהוּ וָ בֹהוּ וְ חֹשֶׁךְ עַל פְּן תְהֹום וְ רוּחַ אֱלֹה רַחֶף עַל פְּן הַ מָּי \n", "וַ אמֶר אֱלֹה הִי אֹור וַ הִי אֹור \n", "וַ רְא אֱלֹה אֶת הָ אֹור כִּי טֹוב וַ בְדֵּל אֱלֹה בֵּין הָ אֹור וּ בֵין הַ חֹשֶׁךְ \n", "וַ קְרָא אֱלֹה לָ אֹור יֹום וְ לַ חֹשֶׁךְ קָרָא לָיְלָה וַ הִי עֶרֶב וַ הִי בֹקֶר יֹום אֶחָד \n", "\n", "lex-orig-plain\n", "ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ \n", "ו ה ארץ היה תהו ו בהו ו חשׁך על פנה תהום ו רוח אלהים רחף על פנה ה מים \n", "ו אמר אלהים היה אור ו היה אור \n", "ו ראה אלהים את ה אור כי טוב ו בדל אלהים בין ה אור ו בין ה חשׁך \n", "ו קרא אלהים ל ה אור יום ו ל ה חשׁך קרא לילה ו היה ערב ו היה בקר יום אחד \n", "\n", "lex-trans-full\n", "B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY \n", "W:- H@- >@REY H@J TOHW. W@- BOHW. W:- XOCEK: :ELOH RAXEP MER >:ELOH HIJ >OWR WA- HIJ >OWR \n", "WA- R:> >:ELOH >ET H@- >OWR K.IJ VOWB WA- B:D.;L >:ELOH B.;JN H@- >OWR W.- B;JN HA- XOCEK: \n", "WA- Q:R@> >:ELOH L@- - >OWR JOWM W:- LA- - XOCEK: Q@R@> L@J:L@H WA- HIJ EX@D \n", "\n", "lex-trans-plain\n", "B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ \n", "W H >RY/ HJH[ THW/ W BHW/ W XCK/ LHJM/ RXP[ MR[ >LHJM/ HJH[ >WR/ W HJH[ >WR/ \n", "W R>H[ >LHJM/ >T H >WR/ KJ VWB[ W BDL[ >LHJM/ BJN/ H >WR/ W BJN/ H XCK/ \n", "W QR>[ >LHJM/ L H >WR/ JWM/ W L H XCK/ QR>[ LJLH/ W HJH[ XD/ \n", "\n", "text-orig-full\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ \n", "וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ \n", "וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ \n", "וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ \n", "\n", "text-orig-full-ketiv\n", "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ \n", "וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ \n", "וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ \n", "וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ \n", "\n", "text-orig-plain\n", "בראשׁית ברא אלהים את השׁמים ואת הארץ׃ \n", "והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃ \n", "ויאמר אלהים יהי אור ויהי־אור׃ \n", "וירא אלהים את־האור כי־טוב ויבדל אלהים בין האור ובין החשׁך׃ \n", "ויקרא אלהים׀ לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ \n", "\n", "text-phono-full\n", "bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . \n", "wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim . \n", "wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr . \n", "wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ . \n", "wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f \n", "\n", "text-trans-full\n", "B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: :ELOHI80JM M:RAXE73PET MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 \n", "WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 \n", "WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&EX@75D00_P \n", "\n", "text-trans-full-ketiv\n", "B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: :ELOHI80JM M:RAXE73PET MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 \n", "WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 \n", "WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&EX@75D00_P \n", "\n", "text-trans-plain\n", "BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 \n", "WH>RY HJTH THW WBHW WXCK LHJM MRXPT MR >LHJM JHJ >WR WJHJ&>WR00 \n", "WJR> >LHJM >T&H>WR KJ&VWB WJBDL >LHJM BJN H>WR WBJN HXCK00 \n", "WJQR> >LHJM05 L>WR JWM WLXCK QR> LJLH WJHJ&XD00_P \n", "\n" ] } ], "source": [ "for fmt in sorted(text):\n", " print(\"{}\\n{}\\n\".format(fmt, \"\\n\".join(text[fmt][0:5])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The full plain text\n", "We write a few formats to file, in your Downloads folder." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "for fmt in \"\"\"\n", " text-orig-full\n", " text-phono-full\n", "\"\"\".strip().split():\n", " with open(os.path.expanduser(f\"~/Downloads/{fmt}.txt\"), \"w\") as f:\n", " f.write(\"\\n\".join(text[fmt]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Book names\n", "\n", "For Bible book names, we can use several languages.\n", "\n", "### Languages\n", "Here are the languages that we can use for book names.\n", "These languages come from the features `book@ll`, where `ll` is a two letter\n", "ISO language code. Have a look in your data directory, you can't miss them." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:36.977529Z", "start_time": "2018-05-18T09:19:36.969202Z" } }, "outputs": [ { "data": { "text/plain": [ "{'': {'language': 'default', 'languageEnglish': 'default'},\n", " 'am': {'language': 'ኣማርኛ', 'languageEnglish': 'amharic'},\n", " 'ar': {'language': 'العَرَبِية', 'languageEnglish': 'arabic'},\n", " 'bn': {'language': 'বাংলা', 'languageEnglish': 'bengali'},\n", " 'da': {'language': 'Dansk', 'languageEnglish': 'danish'},\n", " 'de': {'language': 'Deutsch', 'languageEnglish': 'german'},\n", " 'el': {'language': 'Ελληνικά', 'languageEnglish': 'greek'},\n", " 'en': {'language': 'English', 'languageEnglish': 'english'},\n", " 'es': {'language': 'Español', 'languageEnglish': 'spanish'},\n", " 'fa': {'language': 'فارسی', 'languageEnglish': 'farsi'},\n", " 'fr': {'language': 'Français', 'languageEnglish': 'french'},\n", " 'he': {'language': 'עברית', 'languageEnglish': 'hebrew'},\n", " 'hi': {'language': 'हिन्दी', 'languageEnglish': 'hindi'},\n", " 'id': {'language': 'Bahasa Indonesia', 'languageEnglish': 'indonesian'},\n", " 'ja': {'language': '日本語', 'languageEnglish': 'japanese'},\n", " 'ko': {'language': '한국어', 'languageEnglish': 'korean'},\n", " 'la': {'language': 'Latina', 'languageEnglish': 'latin'},\n", " 'nl': {'language': 'Nederlands', 'languageEnglish': 'dutch'},\n", " 'pa': {'language': 'ਪੰਜਾਬੀ', 'languageEnglish': 'punjabi'},\n", " 'pt': {'language': 'Português', 'languageEnglish': 'portuguese'},\n", " 'ru': {'language': 'Русский', 'languageEnglish': 'russian'},\n", " 'sw': {'language': 'Kiswahili', 'languageEnglish': 'swahili'},\n", " 'syc': {'language': 'ܠܫܢܐ ܣܘܪܝܝܐ', 'languageEnglish': 'syriac'},\n", " 'tr': {'language': 'Türkçe', 'languageEnglish': 'turkish'},\n", " 'ur': {'language': 'اُردُو', 'languageEnglish': 'urdu'},\n", " 'yo': {'language': 'èdè Yorùbá', 'languageEnglish': 'yoruba'},\n", " 'zh': {'language': '中文', 'languageEnglish': 'chinese'}}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.languages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Book names in Swahili\n", "Get the book names in Swahili." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:38.495048Z", "start_time": "2018-05-18T09:19:38.488011Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "426591 = Mwanzo\n", "426592 = Kutoka\n", "426593 = Mambo_ya_Walawi\n", "426594 = Hesabu\n", "426595 = Kumbukumbu_la_Torati\n", "426596 = Yoshua\n", "426597 = Waamuzi\n", "426598 = 1_Samweli\n", "426599 = 2_Samweli\n", "426600 = 1_Wafalme\n", "426601 = 2_Wafalme\n", "426602 = Isaya\n", "426603 = Yeremia\n", "426604 = Ezekieli\n", "426605 = Hosea\n", "426606 = Yoeli\n", "426607 = Amosi\n", "426608 = Obadia\n", "426609 = Yona\n", "426610 = Mika\n", "426611 = Nahumu\n", "426612 = Habakuki\n", "426613 = Sefania\n", "426614 = Hagai\n", "426615 = Zekaria\n", "426616 = Malaki\n", "426617 = Zaburi\n", "426618 = Ayubu\n", "426619 = Mithali\n", "426620 = Ruthi\n", "426621 = Wimbo_Ulio_Bora\n", "426622 = Mhubiri\n", "426623 = Maombolezo\n", "426624 = Esta\n", "426625 = Danieli\n", "426626 = Ezra\n", "426627 = Nehemia\n", "426628 = 1_Mambo_ya_Nyakati\n", "426629 = 2_Mambo_ya_Nyakati\n", "\n" ] } ], "source": [ "nodeToSwahili = \"\"\n", "for b in F.otype.s(\"book\"):\n", " nodeToSwahili += \"{} = {}\\n\".format(b, T.bookName(b, lang=\"sw\"))\n", "print(nodeToSwahili)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Book nodes from Swahili\n", "OK, there they are. We copy them into a string, and do the opposite: get the nodes back.\n", "We check whether we get exactly the same nodes as the ones we started with." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:40.311912Z", "start_time": "2018-05-18T09:19:40.302946Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Going from nodes to booknames and back yields the original nodes\n" ] } ], "source": [ "swahiliNames = \"\"\"\n", "Mwanzo\n", "Kutoka\n", "Mambo_ya_Walawi\n", "Hesabu\n", "Kumbukumbu_la_Torati\n", "Yoshua\n", "Waamuzi\n", "1_Samweli\n", "2_Samweli\n", "1_Wafalme\n", "2_Wafalme\n", "Isaya\n", "Yeremia\n", "Ezekieli\n", "Hosea\n", "Yoeli\n", "Amosi\n", "Obadia\n", "Yona\n", "Mika\n", "Nahumu\n", "Habakuki\n", "Sefania\n", "Hagai\n", "Zekaria\n", "Malaki\n", "Zaburi\n", "Ayubu\n", "Mithali\n", "Ruthi\n", "Wimbo_Ulio_Bora\n", "Mhubiri\n", "Maombolezo\n", "Esta\n", "Danieli\n", "Ezra\n", "Nehemia\n", "1_Mambo_ya_Nyakati\n", "2_Mambo_ya_Nyakati\n", "\"\"\".strip().split()\n", "\n", "swahiliToNode = \"\"\n", "for nm in swahiliNames:\n", " swahiliToNode += \"{} = {}\\n\".format(T.bookNode(nm, lang=\"sw\"), nm)\n", "\n", "if swahiliToNode != nodeToSwahili:\n", " print(\"Something is not right with the book names\")\n", "else:\n", " print(\"Going from nodes to booknames and back yields the original nodes\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sections\n", "\n", "A section in the Hebrew bible is a book, a chapter or a verse.\n", "Knowledge of sections is not baked into Text-Fabric.\n", "The config feature `otext.tf` may specify three section levels, and tell\n", "what the corresponding node types and features are.\n", "\n", "From that knowledge it can construct mappings from nodes to sections, e.g. from verse\n", "nodes to tuples of the form:\n", "\n", " `(bookName, chapterNumber, verseNumber)`\n", "\n", "You can get the section of a node as a tuple of relevant book, chapter, and verse nodes.\n", "Or you can get it as a passage label, a string.\n", "\n", "You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.\n", "\n", "If you are dealing with book and chapter nodes, you can ask to fill out the verse and chapter parts as well.\n", "\n", "Here are examples of getting the section that corresponds to a node and vice versa.\n", "\n", "**NB:** `sectionFromNode` always delivers a verse specification, either from the\n", "first slot belonging to that node, or, if `lastSlot`, from the last slot\n", "belonging to that node." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:43.056511Z", "start_time": "2018-05-18T09:19:43.043552Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 15890 wordShow en - Genesis 30:18 Genesis 30:18 (426591, 426659, 1415237)\n", " la - Genesis 30:18 Genesis 30:18 (426591, 426659, 1415237)\n", " sw - Mwanzo 30:18 Mwanzo 30:18 \n", " 700000 phraseShow en - Numbers 22:31 Numbers 22:31 (426594, 426768, 1418795)\n", " la - Numeri 22:31 Numeri 22:31 (426594, 426768, 1418795)\n", " sw - Hesabu 22:31 Hesabu 22:31 \n", " 500000 clauseShow en - Job 36:27 Job 36:27 (426618, 427382, 1432958)\n", " la - Iob 36:27 Iob 36:27 (426618, 427382, 1432958)\n", " sw - Ayubu 36:27 Ayubu 36:27 \n", "1200000 sentenceShow en - 2_Kings 6:5 2_Kings 6:5 (426601, 426944, 1423986)\n", " la - Reges_II 6:5 Reges_II 6:5 (426601, 426944, 1423986)\n", " sw - 2_Wafalme 6:5 2_Wafalme 6:5 \n", "1437667 lexShow en - Genesis 1:16 2_Chronicles 22:1 (426591, 426630, 1414404)\n", " la - Genesis 1:16 Chronica_II 22:1 (426629, 427544, 1437230)\n", " sw - Mwanzo 1:16 2_Mambo_ya_Nyakati 22:1 \n", "1420000 verseShow en - Deuteronomy 27:25 Deuteronomy 27:25 (426595, 426809, 1420000)\n", " la - Deuteronomium 27:25 Deuteronomium 27:25 (426595, 426809, 1420000)\n", " sw - Kumbukumbu_la_Torati 27:25 Kumbukumbu_la_Torati 27:25 \n", " 427000 chapterShow en - Isaiah 37 Isaiah 37:38 (426602, 427000)\n", " la - Jesaia 37 Jesaia 37:38 (426602, 427000, 1425295)\n", " sw - Isaya 37 Isaya 37:38 \n", " 426598 bookShow en - 1_Samuel 1_Samuel 31:13 (426598,)\n", " la - Samuel_I Samuel_I 31:13 (426598, 426892, 1422328)\n", " sw - 1_Samweli 1_Samweli 31:13 \n" ] } ], "source": [ "\n", "for (desc, n) in chain(normalShow.items(), sectionShow.items()):\n", " for lang in \"en la sw\".split():\n", " d = f\"{n:>7} {desc}\" if lang == \"en\" else \"\"\n", " first = A.sectionStrFromNode(n, lang=lang)\n", " last = A.sectionStrFromNode(n, lang=lang, lastSlot=True, fillup=True)\n", " tup = (\n", " T.sectionTuple(n)\n", " if lang == \"en\"\n", " else T.sectionTuple(n, lastSlot=True, fillup=True)\n", " if lang == \"la\"\n", " else \"\"\n", " )\n", " print(f\"{d:<20} {lang} - {first:<30} {last:<30} {tup}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are examples to get back:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:43.056511Z", "start_time": "2018-05-18T09:19:43.043552Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ezekiel en book 426604\n", "Ezechiel la book 426604\n", "Ezekieli sw book 426604\n", "Isaiah 43 en chapter 427006\n", "Jesaia 43 la chapter 427006\n", "Isaya 43 sw chapter 427006\n", "Deuteronomy 28:34 en verse 1420035\n", "Deuteronomium 28:34 la verse 1420035\n", "Kumbukumbu_la_Torati 28:34 sw verse 1420035\n", "Job 37:3 en verse 1432967\n", "Iob 37:3 la verse 1432967\n", "Ayubu 37:3 sw verse 1432967\n", "Numbers 22:33 en verse 1418797\n", "Numeri 22:33 la verse 1418797\n", "Hesabu 22:33 sw verse 1418797\n", "Genesis 30:18 en verse 1415237\n", "Genesis 30:18 la verse 1415237\n", "Mwanzo 30:18 sw verse 1415237\n", "Genesis 1:30 en verse 1414418\n", "Genesis 1:30 la verse 1414418\n", "Mwanzo 1:30 sw verse 1414418\n", "Psalms 37:2 en verse 1430067\n", "Psalmi 37:2 la verse 1430067\n", "Zaburi 37:2 sw verse 1430067\n" ] } ], "source": [ "for (lang, section) in (\n", " (\"en\", \"Ezekiel\"),\n", " (\"la\", \"Ezechiel\"),\n", " (\"sw\", \"Ezekieli\"),\n", " (\"en\", \"Isaiah 43\"),\n", " (\"la\", \"Jesaia 43\"),\n", " (\"sw\", \"Isaya 43\"),\n", " (\"en\", \"Deuteronomy 28:34\"),\n", " (\"la\", \"Deuteronomium 28:34\"),\n", " (\"sw\", \"Kumbukumbu_la_Torati 28:34\"),\n", " (\"en\", \"Job 37:3\"),\n", " (\"la\", \"Iob 37:3\"),\n", " (\"sw\", \"Ayubu 37:3\"),\n", " (\"en\", \"Numbers 22:33\"),\n", " (\"la\", \"Numeri 22:33\"),\n", " (\"sw\", \"Hesabu 22:33\"),\n", " (\"en\", \"Genesis 30:18\"),\n", " (\"la\", \"Genesis 30:18\"),\n", " (\"sw\", \"Mwanzo 30:18\"),\n", " (\"en\", \"Genesis 1:30\"),\n", " (\"la\", \"Genesis 1:30\"),\n", " (\"sw\", \"Mwanzo 1:30\"),\n", " (\"en\", \"Psalms 37:2\"),\n", " (\"la\", \"Psalmi 37:2\"),\n", " (\"sw\", \"Zaburi 37:2\"),\n", "):\n", " n = A.nodeFromSectionStr(section, lang=lang)\n", " nType = F.otype.v(n)\n", " print(f\"{section:<30} {lang} {nType:<20} {n}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sentences spanning multiple verses\n", "If you go up from a sentence node, you expect to find a verse node.\n", "But some sentences span multiple verses, and in that case, you will not find the enclosing\n", "verse node, because it is not there.\n", "\n", "Here is a piece of code to detect and list all cases where sentences span multiple verses.\n", "\n", "The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to\n", "discover the verse in which that word occurs, and if they are different: bingo!\n", "\n", "We show the first 5 of ca. 900 cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way: doing this in the `2016` version of the data yields 915 results.\n", "The splitting up of the text into sentences is not carved in stone!" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:53.984718Z", "start_time": "2018-05-18T09:19:49.190240Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Get sentences that span multiple verses\n", " 1.09s Found 887 cases\n", " 1.09s \n", "Genesis 1:17-18\n", "Genesis 1:29-30\n", "Genesis 2:4-7\n", "Genesis 7:2-3\n", "Genesis 7:8-9\n", "Genesis 7:13-14\n", "Genesis 9:9-10\n", "Genesis 10:11-12\n", "Genesis 10:13-14\n", "Genesis 10:15-18\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Get sentences that span multiple verses\")\n", "\n", "spanSentences = []\n", "for s in F.otype.s(\"sentence\"):\n", " fs = T.sectionFromNode(s, lastSlot=False)\n", " ls = T.sectionFromNode(s, lastSlot=True)\n", " if fs != ls:\n", " spanSentences.append(\"{} {}:{}-{}\".format(fs[0], fs[1], fs[2], ls[2]))\n", "\n", "A.info(\"Found {} cases\".format(len(spanSentences)))\n", "A.info(\"\\n{}\".format(\"\\n\".join(spanSentences[0:10])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A different way, with better display, is:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:59.897561Z", "start_time": "2018-05-18T09:19:58.291284Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Get sentences that span multiple verses\n", " 0.38s Found 887 cases\n" ] }, { "data": { "text/html": [ "\n", "
npsentenceverseverse
1Genesis 1:17וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Get sentences that span multiple verses\")\n", "\n", "spanSentences = []\n", "for s in F.otype.s(\"sentence\"):\n", " words = L.d(s, otype=\"word\")\n", " fw = words[0]\n", " lw = words[-1]\n", " fVerse = L.u(fw, otype=\"verse\")[0]\n", " lVerse = L.u(lw, otype=\"verse\")[0]\n", " if fVerse != lVerse:\n", " spanSentences.append((s, fVerse, lVerse))\n", "\n", "A.info(\"Found {} cases\".format(len(spanSentences)))\n", "A.table(spanSentences, end=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait a second, the columns with the verses are empty.\n", "In tables, the content of a verse is not shown.\n", "And by default, the passage that is relevant to a row is computed from one of the columns.\n", "\n", "But here, we definitely want the passage of columns 2 and 3, so:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:19:59.897561Z", "start_time": "2018-05-18T09:19:58.291284Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
nsentenceverseverse
1וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ Genesis 1:17  Genesis 1:18  
2הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה Genesis 1:29  Genesis 1:30  
3בְּיֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְשָׁמָֽיִם׃ וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה Genesis 2:4  Genesis 2:7  
4מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו וּמִן־הַבְּהֵמָ֡ה אֲ֠שֶׁר לֹ֣א טְהֹרָ֥ה הִ֛וא שְׁנַ֖יִם אִ֥ישׁ וְאִשְׁתֹּֽו׃ גַּ֣ם מֵעֹ֧וף הַשָּׁמַ֛יִם שִׁבְעָ֥ה שִׁבְעָ֖ה זָכָ֣ר וּנְקֵבָ֑ה לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃ Genesis 7:2  Genesis 7:3  
5מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה אֲשֶׁ֥ר אֵינֶ֖נָּה טְהֹרָ֑ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל אֲשֶׁר־רֹמֵ֖שׂ עַל־הָֽאֲדָמָֽה׃ שְׁנַ֨יִם שְׁנַ֜יִם בָּ֧אוּ אֶל־נֹ֛חַ אֶל־הַתֵּבָ֖ה זָכָ֣ר וּנְקֵבָ֑ה כַּֽאֲשֶׁ֛ר צִוָּ֥ה אֱלֹהִ֖ים אֶת־נֹֽחַ׃ Genesis 7:8  Genesis 7:9  
6בְּעֶ֨צֶם הַיֹּ֤ום הַזֶּה֙ בָּ֣א נֹ֔חַ וְשֵׁם־וְחָ֥ם וָיֶ֖פֶת בְּנֵי־נֹ֑חַ וְאֵ֣שֶׁת נֹ֗חַ וּשְׁלֹ֧שֶׁת נְשֵֽׁי־בָנָ֛יו אִתָּ֖ם אֶל־הַתֵּבָֽה׃ הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ הָרֹמֵ֥שׂ עַל־הָאָ֖רֶץ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃ Genesis 7:13  Genesis 7:14  
7וַאֲנִ֕י הִנְנִ֥י מֵקִ֛ים אֶת־בְּרִיתִ֖י אִתְּכֶ֑ם וְאֶֽת־זַרְעֲכֶ֖ם אַֽחֲרֵיכֶֽם׃ וְאֵ֨ת כָּל־נֶ֤פֶשׁ הַֽחַיָּה֙ אֲשֶׁ֣ר אִתְּכֶ֔ם בָּעֹ֧וף בַּבְּהֵמָ֛ה וּֽבְכָל־חַיַּ֥ת הָאָ֖רֶץ אִתְּכֶ֑ם מִכֹּל֙ יֹצְאֵ֣י הַתֵּבָ֔ה לְכֹ֖ל חַיַּ֥ת הָאָֽרֶץ׃ Genesis 9:9  Genesis 9:10  
8וַיִּ֨בֶן֙ אֶת־נִ֣ינְוֵ֔ה וְאֶת־רְחֹבֹ֥ת עִ֖יר וְאֶת־כָּֽלַח׃ וְֽאֶת־רֶ֔סֶן בֵּ֥ין נִֽינְוֵ֖ה וּבֵ֣ין כָּ֑לַח Genesis 10:11  Genesis 10:12  
9וּמִצְרַ֡יִם יָלַ֞ד אֶת־לוּדִ֧ים וְאֶת־עֲנָמִ֛ים וְאֶת־לְהָבִ֖ים וְאֶת־נַפְתֻּחִֽים׃ וְֽאֶת־פַּתְרֻסִ֞ים וְאֶת־כַּסְלֻחִ֗ים אֲשֶׁ֨ר יָצְא֥וּ מִשָּׁ֛ם פְּלִשְׁתִּ֖ים וְאֶת־כַּפְתֹּרִֽים׃ ס Genesis 10:13  Genesis 10:14  
10וּכְנַ֗עַן יָלַ֛ד אֶת־צִידֹ֥ן בְּכֹרֹ֖ו וְאֶת־חֵֽת׃ וְאֶת־הַיְבוּסִי֙ וְאֶת־הָ֣אֱמֹרִ֔י וְאֵ֖ת הַגִּרְגָּשִֽׁי׃ וְאֶת־הַֽחִוִּ֥י וְאֶת־הַֽעַרְקִ֖י וְאֶת־הַסִּינִֽי׃ וְאֶת־הָֽאַרְוָדִ֥י וְאֶת־הַצְּמָרִ֖י וְאֶת־הַֽחֲמָתִ֑י Genesis 10:15  Genesis 10:18  
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table(spanSentences, end=10, withPassage={2, 3})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can zoom in:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:03.251841Z", "start_time": "2018-05-18T09:20:03.227631Z" } }, "outputs": [ { "data": { "text/html": [ "

result 6

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
verse
sentence
clause בְּעֶ֨צֶם הַיֹּ֤ום הַזֶּה֙ בָּ֣א נֹ֔חַ וְשֵׁם־וְחָ֥ם וָיֶ֖פֶת בְּנֵי־נֹ֑חַ וְאֵ֣שֶׁת נֹ֗חַ וּשְׁלֹ֧שֶׁת נְשֵֽׁי־בָנָ֛יו אִתָּ֖ם אֶל־הַתֵּבָֽה׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
verse
sentence
clause הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ
clause הָרֹמֵ֥שׂ עַל־הָאָ֖רֶץ
clause לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show(spanSentences, condensed=False, start=6, end=6, baseTypes={\"sentence_atom\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Ketiv Qere\n", "Let us explore where Ketiv/Qere pairs are and how they render." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:09.687854Z", "start_time": "2018-05-18T09:20:09.498982Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1892 qeres\n", "3897: ketiv = \"*HWY>\"+\" \" qere = \"HAJ:Y;74>\"+\" \"\n", "4420: ketiv = \"*>HLH\"+\"00 \" qere = \">@H:@LO75W\"+\"00\"\n", "5645: ketiv = \"*>HLH\"+\" \" qere = \">@H:@LO92W\"+\" \"\n", "5912: ketiv = \"*>HLH\"+\" \" qere = \">@95H:@LOW03\"+\" \"\n", "6246: ketiv = \"*YBJJM\"+\" \" qere = \"Y:BOWJI80m\"+\" \"\n", "6354: ketiv = \"*YBJJM\"+\" \" qere = \"Y:BOWJI80m\"+\" \"\n", "11762: ketiv = \"*W-\"+\"\" qere = \"WA\"+\"\"\n", "11763: ketiv = \"*JJFM\"+\" \" qere = \"J.W.FA70m\"+\" \"\n", "12784: ketiv = \"*GJJM\"+\" \" qere = \"GOWJIm03\"+\" \"\n", "13685: ketiv = \"*YJDH\"+\"00 \" qere = \"Y@75JID\"+\"00\"\n" ] } ], "source": [ "qeres = [w for w in F.otype.s(\"word\") if F.qere.v(w) is not None]\n", "print(\"{} qeres\".format(len(qeres)))\n", "for w in qeres[0:10]:\n", " print(\n", " '{}: ketiv = \"{}\"+\"{}\" qere = \"{}\"+\"{}\"'.format(\n", " w,\n", " F.g_word.v(w),\n", " F.trailer.v(w),\n", " F.qere.v(w),\n", " F.qere_trailer.v(w),\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show a ketiv-qere pair\n", "Let us print all text representations of the verse in which the second qere occurs." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:11.158371Z", "start_time": "2018-05-18T09:20:11.149950Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reference word is 4420\n", "Genesis 9:21\n", "text-orig-full וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אָהֳלֹֽו׃\n", "text-orig-full-ketiv וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אהלה׃ \n", "text-orig-plain וישׁת מן־היין וישׁכר ויתגל בתוך אהלה׃ \n", "text-phono-full wayyˌēšt min-hayyˌayin wayyiškˈār wayyiṯgˌal bᵊṯˌôḵ *ʔohᵒlˈô .\n", "text-trans-full WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: >@H:@LO75W00\n", "text-trans-full-ketiv WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: *>HLH00 \n", "text-trans-plain WJCT MN&HJJN WJCKR WJTGL BTWK >HLH00 \n" ] } ], "source": [ "refWord = qeres[1]\n", "print(f\"Reference word is {refWord}\")\n", "vn = L.u(refWord, otype=\"verse\")[0]\n", "print(\"{} {}:{}\".format(*T.sectionFromNode(refWord)))\n", "for fmt in sorted(T.formats):\n", " if fmt.startswith(\"text-\"):\n", " print(\"{:<25} {}\".format(fmt, T.text(vn, fmt=fmt, descend=True)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Edge features: mother\n", "\n", "We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet,\n", "the edges point from one row to another.\n", "\n", "One edge we have encountered: the special feature `oslots`.\n", "Each non-slot node is linked by `oslots` to all of its slot nodes.\n", "\n", "An edge is really a feature as well.\n", "Whereas a node feature is a column of information,\n", "one cell per node,\n", "an edge feature is also a column of information, one cell per pair of nodes.\n", "\n", "Linguists use more relationships between textual objects, for example:\n", "linguistic dependency.\n", "In the BHSA all cases of linguistic dependency are coded in the edge feature `mother`.\n", "\n", "Let us do a few basic enquiry on an edge feature:\n", "[mother](https://etcbc.github.io/bhsa/features/hebrew/2017/mother).\n", "\n", "We count how many mothers nodes can have (it turns to be 0 or 1).\n", "We walk through all nodes and per node we retrieve the mother nodes, and\n", "we store the lengths (if non-zero) in a dictionary (`mother_len`).\n", "\n", "We see that nodes have at most one mother.\n", "\n", "We also count the inverse relationship: daughters." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "ExecuteTime": { "end_time": "2018-05-18T09:20:24.066854Z", "start_time": "2018-05-18T09:20:20.609907Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Counting mothers\n", " 0.73s 182269 nodes have mothers\n", " 0.73s 144112 nodes have daughters\n", "mothers Counter({1: 182269})\n", "daughters Counter({1: 117986, 2: 17370, 3: 6284, 4: 1851, 5: 470, 6: 125, 7: 21, 8: 5})\n" ] } ], "source": [ "A.indent(reset=True)\n", "A.info(\"Counting mothers\")\n", "\n", "motherLen = {}\n", "daughterLen = {}\n", "\n", "for c in N.walk():\n", " lms = E.mother.f(c) or []\n", " lds = E.mother.t(c) or []\n", " nms = len(lms)\n", " nds = len(lds)\n", " if nms:\n", " motherLen[c] = nms\n", " if nds:\n", " daughterLen[c] = nds\n", "\n", "A.info(\"{} nodes have mothers\".format(len(motherLen)))\n", "A.info(\"{} nodes have daughters\".format(len(daughterLen)))\n", "\n", "motherCount = collections.Counter()\n", "daughterCount = collections.Counter()\n", "\n", "for (n, lm) in motherLen.items():\n", " motherCount[lm] += 1\n", "for (n, ld) in daughterLen.items():\n", " daughterCount[ld] += 1\n", "\n", "print(\"mothers\", motherCount)\n", "print(\"daughters\", daughterCount)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clean caches\n", "\n", "Text-Fabric pre-computes data for you, so that it can be loaded faster.\n", "If the original data is updated, Text-Fabric detects it, and will recompute that data.\n", "\n", "But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might\n", "want to clear the cache of precomputed results.\n", "\n", "There are two ways to do that:\n", "\n", "* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.\n", " This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.\n", "* Call `TF.clearCache()`, which does exactly the same.\n", "\n", "It is not handy to execute the following cell all the time, that's why I have commented it out.\n", "So if you really want to clear the cache, remove the comment sign below." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "# TF.clearCache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# All steps\n", "\n", "By now you have an impression how to compute around in the Hebrew Bible.\n", "While this is still the beginning, I hope you already sense the power of unlimited programmatic access\n", "to all the bits and bytes in the data set.\n", "\n", "Here are a few directions for unleashing that power.\n", "\n", "* **start** your first step in mastering the bible computationally\n", "* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures\n", "* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n", "* **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n", "* **[share](share.ipynb)** draw in other people's data and let them use yours\n", "* **[export](export.ipynb)** export your dataset as an Emdros database\n", "* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features\n", "* **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus\n", "* **[volumes](volumes.ipynb)** work with selected books only\n", "* **[trees](trees.ipynb)** work with the BHSA data as syntax trees\n", "\n", "CC-BY Dirk Roorda" ] } ], "metadata": { "jupytext": { "encoding": "# -*- coding: utf-8 -*-" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "toc-autonumbering": false, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }