{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "You might want to consider the [start](search.ipynb) of this tutorial.\n", "\n", "Short introductions to other TF datasets:\n", "\n", "* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),\n", "* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),\n", "or the\n", "* [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Export\n", "\n", "Text-Fabric is not a world to stay in for ever.\n", "When you go to other worlds, you can travel with the corpus data in your backpack.\n", "\n", "Here we show two destinations (and one of them is also an origin):\n", "Pandas and Emdros.\n", "\n", "Before we go there, we load the corpus." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Incantation\n", "\n", "The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are\n", "explained in the [start tutorial](start.ipynb)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:39.818664Z", "start_time": "2018-05-24T10:06:39.796588Z" } }, "outputs": [], "source": [ "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-24T10:06:51.615044Z", "start_time": "2018-05-24T10:06:50.161456Z" } }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " TF: TF API 12.1.4, ETCBC/bhsa/app v3, Search Reference
\n", " Data: ETCBC - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: ETCBC/bhsa
  3. appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
  4. commit: gb112c161cfd21eae403d51a2733740d8743460e7
  5. css: ''
  6. dataDisplay:
    • exampleSectionHtml:<code>Genesis 1:1</code> (use <a href=\"https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf\" target=\"_blank\">English book names</a>)
    • excludedFeatures:
      • g_uvf_utf8
      • g_vbs
      • kq_hybrid
      • languageISO
      • g_nme
      • lex0
      • is_root
      • g_vbs_utf8
      • g_uvf
      • dist
      • root
      • suffix_person
      • g_vbe
      • dist_unit
      • suffix_number
      • distributional_parent
      • kq_hybrid_utf8
      • crossrefSET
      • instruction
      • g_prs
      • lexeme_count
      • rank_occ
      • g_pfm_utf8
      • freq_occ
      • crossrefLCS
      • functional_parent
      • g_pfm
      • g_nme_utf8
      • g_vbe_utf8
      • kind
      • g_prs_utf8
      • suffix_gender
      • mother_object_type
    • noneValues:
      • absent
      • n/a
      • none
      • unknown
      • no value
      • NA
  7. docs:
    • docBase: {docRoot}/{repo}
    • docExt: ''
    • docPage: ''
    • docRoot: https://{org}.github.io
    • featurePage: 0_home
  8. interfaceDefaults: {}
  9. isCompatible: True
  10. local: local
  11. localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
  12. provenanceSpec:
    • corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
    • doi: 10.5281/zenodo.1007624
    • extraData: ner
    • moduleSpecs:
      • :
        • backend: no value
        • corpus: Phonetic Transcriptions
        • docUrl:https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
        • doi: 10.5281/zenodo.1007636
        • org: ETCBC
        • relative: /tf
        • repo: phono
      • :
        • backend: no value
        • corpus: Parallel Passages
        • docUrl:https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
        • doi: 10.5281/zenodo.1007642
        • org: ETCBC
        • relative: /tf
        • repo: parallels
    • org: ETCBC
    • relative: /tf
    • repo: bhsa
    • version: 2021
    • webBase: https://shebanq.ancient-data.org/hebrew
    • webHint: Show this on SHEBANQ
    • webLang: la
    • webLexId: True
    • webUrl:{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
    • webUrlLex: {webBase}/word?version={version}&id=<lid>
  13. release: v1.8.1
  14. typeDisplay:
    • clause:
      • label: {typ} {rela}
      • style: ''
    • clause_atom:
      • hidden: True
      • label: {code}
      • level: 1
      • style: ''
    • half_verse:
      • hidden: True
      • label: {label}
      • style: ''
      • verselike: True
    • lex:
      • featuresBare: gloss
      • label: {voc_lex_utf8}
      • lexOcc: word
      • style: orig
      • template: {voc_lex_utf8}
    • phrase:
      • label: {typ} {function}
      • style: ''
    • phrase_atom:
      • hidden: True
      • label: {typ} {rela}
      • level: 1
      • style: ''
    • sentence:
      • label: {number}
      • style: ''
    • sentence_atom:
      • hidden: True
      • label: {number}
      • level: 1
      • style: ''
    • subphrase:
      • hidden: True
      • label: {number}
      • style: ''
    • word:
      • features: pdp vs vt
      • featuresBare: lex:gloss
  15. writing: hbo
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\", hoist=globals())" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Pandas\n", "\n", "The first journey is to \n", "[Pandas](https://pandas.pydata.org).\n", "\n", "We convert the data to a data frame, via a tab-separated text file.\n", "\n", "The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.\n", "\n", "The BHSA features become the columns, so each row tells what values the features have for the corresponding node.\n", "\n", "The edges corresponding to the BHSA features `mother`, `functional_parent`, `distributional_parent` are\n", "exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.\n", "\n", "We also write the data that says which objects are contained in which.\n", "To each row we add the following columns:\n", "\n", "* for each node type, except `word` there is a column with that node type as name;\n", " the value in that column is the node of this type that contains the row node (if any).\n", "\n", "Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While exporting the data to Pandas format, the program\n", "composes the big table and saves it as a tab delimited file.\n", "This is stored in a temporary directory (not visible on GitHub).\n", "\n", "This temporary file can also be read by R, but we proceed with Pandas.\n", "Pandas offers functions in the same spirit as R, but is more Pythonic and also faster." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Create tsv file ...\n", " | 2.96s 5% 72342 nodes written\n", " | 5.89s 10% 144684 nodes written\n", " | 8.80s 15% 217026 nodes written\n", " | 12s 20% 289368 nodes written\n", " | 15s 25% 361710 nodes written\n", " | 18s 30% 434052 nodes written\n", " | 21s 35% 506394 nodes written\n", " | 24s 40% 578736 nodes written\n", " | 27s 45% 651078 nodes written\n", " | 30s 50% 723420 nodes written\n", " | 33s 55% 795762 nodes written\n", " | 36s 60% 868104 nodes written\n", " | 39s 65% 940446 nodes written\n", " | 42s 70% 1012788 nodes written\n", " | 45s 75% 1085130 nodes written\n", " | 48s 80% 1157472 nodes written\n", " | 50s 85% 1229814 nodes written\n", " | 53s 90% 1302156 nodes written\n", " | 56s 95% 1374498 nodes written\n", " | 59s 95% 1446831 nodes written and done\n", " 59s TSV file is ~/text-fabric-data/github/ETCBC/bhsa/_temp/data-2021.tsv\n", " 59s Columns 72:\n", " 59s \tnd\n", " 59s \totype\n", " 59s \tg_cons\n", " 59s \tg_cons_utf8\n", " 59s \tg_lex\n", " 59s \tg_lex_utf8\n", " 59s \tg_word\n", " 59s \tg_word_utf8\n", " 59s \tlex\n", " 59s \tlex_utf8\n", " 59s \tphono\n", " 59s \tphono_trailer\n", " 59s \tqere\n", " 59s \tqere_trailer\n", " 59s \tqere_trailer_utf8\n", " 59s \tqere_utf8\n", " 59s \ttrailer\n", " 59s \ttrailer_utf8\n", " 59s \tvoc_lex_utf8\n", " 59s \tin_book\n", " 59s \tin_chapter\n", " 59s \tin_verse\n", " 59s \tin_lex\n", " 59s \tin_half_verse\n", " 59s \tin_sentence\n", " 59s \tin_sentence_atom\n", " 59s \tin_clause\n", " 59s \tin_clause_atom\n", " 59s \tin_phrase\n", " 59s \tin_phrase_atom\n", " 59s \tin_subphrase\n", " 59s \tin_word\n", " 59s \tcrossref\n", " 59s \tmother\n", " 59s \tbook\n", " 59s \tchapter\n", " 59s \tcode\n", " 59s \tdet\n", " 59s \tdomain\n", " 59s \tfreq_lex\n", " 59s \tfunction\n", " 59s \tgloss\n", " 59s \tgn\n", " 59s \tlabel\n", " 59s \tlanguage\n", " 59s \tls\n", " 59s \tnametype\n", " 59s \tnme\n", " 59s \tnu\n", " 59s \tnumber\n", " 59s \tpargr\n", " 59s \tpdp\n", " 59s \tpfm\n", " 59s \tprs\n", " 59s \tprs_gn\n", " 59s \tprs_nu\n", " 59s \tprs_ps\n", " 59s \tps\n", " 59s \trank_lex\n", " 59s \trela\n", " 59s \tsp\n", " 59s \tst\n", " 59s \ttab\n", " 59s \ttxt\n", " 59s \ttyp\n", " 59s \tuvf\n", " 59s \tvbe\n", " 59s \tvbs\n", " 59s \tverse\n", " 59s \tvoc_lex\n", " 59s \tvs\n", " 59s \tvt\n", "\n", " 1m 00s \t1446832 rows\n", " 1m 00s \t273843208 characters\n", " 1m 00s Importing into Pandas ...\n", " | 0.00s Reading tsv file ...\n", " | 13s Done. Size = 104171832\n", " | 13s Saving as Parquet file ...\n", " | 19s Saved\n", " 1m 19s PD in ~/text-fabric-data/github/ETCBC/bhsa/pandas/data-2021.pd\n" ] } ], "source": [ "A.exportPandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to use the Pandas file\n", "\n", "See\n", "[pandas](pandas.ipynb)\n", "for a tutorial on how to work with the BHSA as a data frame.\n", "\n", "We collect a few pieces of data that will come in handy.\n", "\n", "Here is the the first verse node:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1414389" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.otype.s(\"verse\")[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# MQL\n", "\n", "The next journey is to MQL, a text-database format not unlike SQL, supported by the Emdros software.\n", "\n", "[EMDROS](http://emdros.org), written by Ulrik Petersen,\n", "is a text database system with the powerful *topographic* query language MQL.\n", "The ideas are based on a model devised by Christ-Jan Doedens in\n", "[Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).\n", "\n", "Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.\n", "\n", "[SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.\n", "\n", "So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.\n", "\n", "If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`,\n", "which we will not show here.\n", "\n", "And if you want to export a Text-Fabric data set to MQL, that is also possible.\n", "\n", "After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the\n", "indicated modules into a big MQL dump, which can be imported by an EMDROS database." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-02-15T09:27:12.673630Z", "start_time": "2018-02-15T09:25:52.241804Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Checking features of dataset mybhsa\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " | 4m 45s feature \"book@am\" => \"book_am\"\n", " | 4m 45s feature \"book@ar\" => \"book_ar\"\n", " | 4m 45s feature \"book@bn\" => \"book_bn\"\n", " | 4m 45s feature \"book@da\" => \"book_da\"\n", " | 4m 45s feature \"book@de\" => \"book_de\"\n", " | 4m 45s feature \"book@el\" => \"book_el\"\n", " | 4m 45s feature \"book@en\" => \"book_en\"\n", " | 4m 45s feature \"book@es\" => \"book_es\"\n", " | 4m 45s feature \"book@fa\" => \"book_fa\"\n", " | 4m 45s feature \"book@fr\" => \"book_fr\"\n", " | 4m 45s feature \"book@he\" => \"book_he\"\n", " | 4m 45s feature \"book@hi\" => \"book_hi\"\n", " | 4m 45s feature \"book@id\" => \"book_id\"\n", " | 4m 45s feature \"book@ja\" => \"book_ja\"\n", " | 4m 45s feature \"book@ko\" => \"book_ko\"\n", " | 4m 45s feature \"book@la\" => \"book_la\"\n", " | 4m 45s feature \"book@nl\" => \"book_nl\"\n", " | 4m 45s feature \"book@pa\" => \"book_pa\"\n", " | 4m 45s feature \"book@pt\" => \"book_pt\"\n", " | 4m 45s feature \"book@ru\" => \"book_ru\"\n", " | 4m 45s feature \"book@sw\" => \"book_sw\"\n", " | 4m 45s feature \"book@syc\" => \"book_syc\"\n", " | 4m 45s feature \"book@tr\" => \"book_tr\"\n", " | 4m 45s feature \"book@ur\" => \"book_ur\"\n", " | 4m 45s feature \"book@yo\" => \"book_yo\"\n", " | 4m 45s feature \"book@zh\" => \"book_zh\"\n", " | 4m 45s feature \"omap@2017-2021\" => \"omap_2017_2021\"\n", " | 4m 45s feature \"omap@c-2021\" => \"omap_c_2021\"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " 0.02s 118 features to export to MQL ...\n", " 0.02s Loading 118 features\n", " | 0.07s T crossrefLCS from ~/text-fabric-data/github/ETCBC/parallels/tf/2021\n", " | 0.04s T crossrefSET from ~/text-fabric-data/github/ETCBC/parallels/tf/2021\n", " | 1.20s T dist_unit from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 3.18s T distributional_parent from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.77s T freq_occ from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 3.98s T functional_parent from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.77s T g_nme from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.79s T g_nme_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.71s T g_pfm from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.73s T g_pfm_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.71s T g_prs from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.71s T g_prs_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.68s T g_uvf from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.68s T g_uvf_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.72s T g_vbe from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.70s T g_vbe_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.70s T g_vbs from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.68s T g_vbs_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.18s T instruction from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.18s T is_root from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.17s T kind from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.68s T kq_hybrid from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.69s T kq_hybrid_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.84s T languageISO from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.94s T lex0 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.76s T lexeme_count from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.40s T mother_object_type from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 6.39s T omap@2017-2021 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 6.31s T omap@c-2021 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.75s T rank_occ from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.17s T root from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.83s T suffix_gender from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.83s T suffix_number from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " | 0.82s T suffix_person from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021\n", " 39s Writing enumerations\n", "\tbook_am : 39 values, 39 not a name, e.g. «መኃልየ_መኃልይ_ዘሰሎሞን»\n", "\tbook_ar : 39 values, 39 not a name, e.g. «1_اخبار»\n", "\tbook_bn : 39 values, 39 not a name, e.g. «আদিপুস্তক»\n", "\tbook_da : 39 values, 13 not a name, e.g. «1.Kongebog»\n", "\tbook_de : 39 values, 7 not a name, e.g. «1_Chronik»\n", "\tbook_el : 39 values, 39 not a name, e.g. «Άσμα_Ασμάτων»\n", "\tbook_en : 39 values, 6 not a name, e.g. «1_Chronicles»\n", "\tbook_es : 39 values, 22 not a name, e.g. «1_Crónicas»\n", "\tbook_fa : 39 values, 39 not a name, e.g. «استر»\n", "\tbook_fr : 39 values, 19 not a name, e.g. «1_Chroniques»\n", "\tbook_he : 39 values, 39 not a name, e.g. «איוב»\n", "\tbook_hi : 39 values, 39 not a name, e.g. «1_इतिहास»\n", "\tbook_id : 39 values, 7 not a name, e.g. «1_Raja-raja»\n", "\tbook_ja : 39 values, 39 not a name, e.g. «アモス書»\n", "\tbook_ko : 39 values, 39 not a name, e.g. «나훔»\n", "\tbook_nl : 39 values, 8 not a name, e.g. «1_Koningen»\n", "\tbook_pa : 39 values, 39 not a name, e.g. «1_ਇਤਹਾਸ»\n", "\tbook_pt : 39 values, 21 not a name, e.g. «1_Crônicas»\n", "\tbook_ru : 39 values, 39 not a name, e.g. «1-я_Паралипоменон»\n", "\tbook_sw : 39 values, 6 not a name, e.g. «1_Mambo_ya_Nyakati»\n", "\tbook_syc : 39 values, 39 not a name, e.g. «ܐ_ܒܪܝܡܝܢ»\n", "\tbook_tr : 39 values, 16 not a name, e.g. «1_Krallar»\n", "\tbook_ur : 39 values, 39 not a name, e.g. «احبار»\n", "\tbook_yo : 39 values, 8 not a name, e.g. «Amọsi»\n", "\tbook_zh : 38 values, 37 not a name, e.g. «以斯帖记»\n", "\tdomain : 4 values, 1 not a name, e.g. «?»\n", "\tg_nme : 108 values, 108 not a name, e.g. «»\n", "\tg_nme_utf8 : 106 values, 106 not a name, e.g. «»\n", "\tg_pfm : 87 values, 87 not a name, e.g. «»\n", "\tg_pfm_utf8 : 86 values, 86 not a name, e.g. «»\n", "\tg_prs : 127 values, 127 not a name, e.g. «»\n", "\tg_prs_utf8 : 126 values, 126 not a name, e.g. «»\n", "\tg_uvf : 19 values, 19 not a name, e.g. «»\n", "\tg_uvf_utf8 : 17 values, 17 not a name, e.g. «»\n", "\tg_vbe : 101 values, 101 not a name, e.g. «»\n", "\tg_vbe_utf8 : 97 values, 97 not a name, e.g. «»\n", "\tg_vbs : 66 values, 66 not a name, e.g. «»\n", "\tg_vbs_utf8 : 65 values, 65 not a name, e.g. «»\n", "\tinstruction : 35 values, 20 not a name, e.g. «.#»\n", "\tnametype : 10 values, 5 not a name, e.g. «gens,topo»\n", "\tnme : 20 values, 7 not a name, e.g. «»\n", "\tpfm : 11 values, 4 not a name, e.g. «»\n", "\tphono_trailer : 4 values, 4 not a name, e.g. «»\n", "\tprs : 22 values, 4 not a name, e.g. «H=»\n", "\tqere_trailer : 5 values, 5 not a name, e.g. «»\n", "\tqere_trailer_utf8: 5 values, 5 not a name, e.g. «»\n", "\troot : 757 values, 212 not a name, e.g. «»\n", "\ttrailer : 13 values, 13 not a name, e.g. «»\n", "\ttrailer_utf8 : 13 values, 13 not a name, e.g. «»\n", "\ttxt : 136 values, 59 not a name, e.g. «?»\n", "\tuvf : 6 values, 1 not a name, e.g. «>»\n", "\tvbe : 19 values, 6 not a name, e.g. «»\n", "\tvbs : 11 values, 3 not a name, e.g. «>»\n", " | 0.36s Writing an all-in-one enum with 232 values\n", " 39s Mapping 118 features onto 13 object types\n", " 42s Writing 118 features as data in 13 object types\n", " | 0.00s word data ...\n", " | | 1.24s batch of size 49.9MB with 50000 of 50000 words\n", " | | 2.49s batch of size 50.0MB with 50000 of 100000 words\n", " | | 3.74s batch of size 50.2MB with 50000 of 150000 words\n", " | | 4.99s batch of size 50.2MB with 50000 of 200000 words\n", " | | 6.24s batch of size 50.4MB with 50000 of 250000 words\n", " | | 7.50s batch of size 50.4MB with 50000 of 300000 words\n", " | | 8.76s batch of size 50.5MB with 50000 of 350000 words\n", " | | 10s batch of size 50.4MB with 50000 of 400000 words\n", " | | 11s batch of size 26.8MB with 26590 of 426590 words\n", " | 11s word data: 426590 objects\n", " | 0.00s subphrase data ...\n", " | | 0.18s batch of size 8.6MB with 50000 of 50000 subphrases\n", " | | 0.35s batch of size 8.5MB with 50000 of 100000 subphrases\n", " | | 0.40s batch of size 2.4MB with 13850 of 113850 subphrases\n", " | 0.40s subphrase data: 113850 objects\n", " | 0.00s phrase_atom data ...\n", " | | 0.26s batch of size 12.0MB with 50000 of 50000 phrase_atoms\n", " | | 0.51s batch of size 12.0MB with 50000 of 100000 phrase_atoms\n", " | | 0.77s batch of size 12.2MB with 50000 of 150000 phrase_atoms\n", " | | 1.03s batch of size 12.2MB with 50000 of 200000 phrase_atoms\n", " | | 1.28s batch of size 12.1MB with 50000 of 250000 phrase_atoms\n", " | | 1.37s batch of size 4.3MB with 17532 of 267532 phrase_atoms\n", " | 1.37s phrase_atom data: 267532 objects\n", " | 0.00s phrase data ...\n", " | | 0.23s batch of size 10.9MB with 50000 of 50000 phrases\n", " | | 0.45s batch of size 11.0MB with 50000 of 100000 phrases\n", " | | 0.68s batch of size 11.0MB with 50000 of 150000 phrases\n", " | | 0.92s batch of size 11.0MB with 50000 of 200000 phrases\n", " | | 1.15s batch of size 11.0MB with 50000 of 250000 phrases\n", " | | 1.16s batch of size 724.2KB with 3203 of 253203 phrases\n", " | 1.16s phrase data: 253203 objects\n", " | 0.00s clause_atom data ...\n", " | | 0.32s batch of size 14.4MB with 50000 of 50000 clause_atoms\n", " | | 0.59s batch of size 11.7MB with 40704 of 90704 clause_atoms\n", " | 0.59s clause_atom data: 90704 objects\n", " | 0.00s clause data ...\n", " | | 0.28s batch of size 13.3MB with 50000 of 50000 clauses\n", " | | 0.49s batch of size 10.2MB with 38131 of 88131 clauses\n", " | 0.49s clause data: 88131 objects\n", " | 0.00s sentence_atom data ...\n", " | | 0.18s batch of size 7.8MB with 50000 of 50000 sentence_atoms\n", " | | 0.23s batch of size 2.3MB with 14514 of 64514 sentence_atoms\n", " | 0.23s sentence_atom data: 64514 objects\n", " | 0.00s sentence data ...\n", " | | 0.14s batch of size 6.3MB with 50000 of 50000 sentences\n", " | | 0.18s batch of size 1.7MB with 13717 of 63717 sentences\n", " | 0.18s sentence data: 63717 objects\n", " | 0.00s half_verse data ...\n", " | | 0.13s batch of size 5.5MB with 45179 of 45179 half_verses\n", " | 0.13s half_verse data: 45179 objects\n", " | 0.00s verse data ...\n", " | | 0.12s batch of size 4.8MB with 23213 of 23213 verses\n", " | 0.12s verse data: 23213 objects\n", " | 0.00s lex data ...\n", " | | 0.16s batch of size 5.5MB with 9230 of 9230 lexs\n", " | 0.16s lex data: 9230 objects\n", " | 0.00s chapter data ...\n", " | | 0.02s batch of size 131.1KB with 929 of 929 chapters\n", " | 0.02s chapter data: 929 objects\n", " | 0.00s book data ...\n", " | | 0.02s batch of size 29.2KB with 39 of 39 books\n", " | 0.02s book data: 39 objects\n", " 57s MQL in ~/Downloads/mql\n", " 57s Done\n" ] } ], "source": [ "A.exportMQL(\"mybhsa\", exportDir=\"~/Downloads/mql\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have a file `~/Downloads/mql/mybhsa.mql` of 530 MB.\n", "You can import it into an Emdros database by saying:\n", "\n", "```\n", "cd ~/Downloads/mql\n", "rm mybhsa.mql\n", "mql -b 3 < mybhsa.mql\n", "```\n", "\n", "The result is an SQLite3 database `mybhsa` in the same directory (168 MB).\n", "You can run a query against it by creating a text file test.mql with this contents:\n", "\n", "```\n", "select all objects where\n", "[lex gloss ~ 'make'\n", " [word FOCUS]\n", "]\n", "```\n", "\n", "And then say\n", "\n", "```\n", "mql -b 3 -d mybhsa test.mql\n", "```\n", "\n", "You will see raw query results: all word occurrences that belong to lexemes with `make` in their gloss.\n", "\n", "It is not very pretty, and probably you should use a more visual Emdros tool to run those queries.\n", "You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# All steps\n", "\n", "* **[start](start.ipynb)** your first step in mastering the bible computationally\n", "* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures\n", "* **[search](search.ipynb)** turbo charge your hand-coding with search templates\n", "* **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results\n", "* **[share](share.ipynb)** draw in other people's data and let them use yours\n", "* **export** export your dataset as an Emdros database\n", "* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features\n", "* **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus\n", "* **[volumes](volumes.ipynb)** work with selected books only\n", "* **[trees](trees.ipynb)** work with the BHSA data as syntax trees\n", "\n", "CC-BY Dirk Roorda" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11 (ipykernel)", "language": "python", "name": "p311" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }