{ "cells": [ { "cell_type": "markdown", "id": "9ed82600-1b03-483a-96ae-a5445e1c8856", "metadata": {}, "source": [ "# A minimalistic BHSA\n", "\n", "We create BHSA-min out of BHSA by removing certain nodetypes and features.\n", "\n", "We use the [modify()](https://annotation.github.io/text-fabric/tf/dataset/modify.html) function." ] }, { "cell_type": "code", "execution_count": 1, "id": "1556d8a4-d94f-449e-a022-a43d18151b23", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "b61aa2c9-7fbb-4364-8c60-bb69693c1b88", "metadata": {}, "outputs": [], "source": [ "import collections\n", "from tf.app import use\n", "from tf.dataset import modify\n", "from tf.core.files import initTree" ] }, { "cell_type": "code", "execution_count": 3, "id": "b7e47c0a-0389-4256-86fd-9252a27d2da5", "metadata": {}, "outputs": [], "source": [ "BASE = \"~/github\"\n", "ORG = \"ETCBC\"\n", "REPO = \"bhsa\"\n", "REPO_MIN = \"bhsa-min\"\n", "RELATIVE = \"/tf\"\n", "VERSION = \"2021\"" ] }, { "cell_type": "markdown", "id": "ca805274-692a-4091-a29e-dd323f9d0f6a", "metadata": {}, "source": [ "We remove a big part of the hierarchy." ] }, { "cell_type": "code", "execution_count": 4, "id": "3af43a9e-69c8-4f5b-9d68-4a760606004c", "metadata": {}, "outputs": [], "source": [ "deleteTypes = \"\"\"\n", " lex\n", " subphrase\n", " phrase_atom\n", " clause_atom\n", " sentence_atom\n", " half_verse\n", "\"\"\".strip().split()" ] }, { "cell_type": "markdown", "id": "9c141924-036c-4558-98da-6c0aa855653f", "metadata": {}, "source": [ "We remove a large number of relatively obscure features." ] }, { "cell_type": "code", "execution_count": 5, "id": "812d0ce4-4f9a-4375-9d47-7d8cbdcab92e", "metadata": {}, "outputs": [], "source": [ "deleteFeatures = \"\"\"\n", " book@am\n", " book@ar\n", " book@bn\n", " book@da\n", " book@de\n", " book@el\n", " book@es\n", " book@fa\n", " book@fr\n", " book@he\n", " book@hi\n", " book@id\n", " book@ja\n", " book@ko\n", " book@la\n", " book@nl\n", " book@pa\n", " book@pt\n", " book@ru\n", " book@sw\n", " book@syc\n", " book@tr\n", " book@ur\n", " book@yo\n", " book@zh\n", " dist\n", " dist_unit\n", " distributional_parent\n", " freq_occ\n", " functional_parent\n", " g_lex\n", " g_lex_utf8\n", " g_nme\n", " g_nme_utf8\n", " g_pfm\n", " g_pfm_utf8\n", " g_prs\n", " g_prs_utf8\n", " g_uvf\n", " g_uvf_utf8\n", " g_vbe\n", " g_vbe_utf8\n", " g_vbs\n", " g_vbs_utf8\n", " is_root\n", " kq_hybrid\n", " kq_hybrid_utf8\n", " languageISO\n", " lex0\n", " lexeme_count\n", " mother_object_type\n", " number\n", " omap@2017-2021\n", " omap@c-2021\n", " rank_lex\n", " rank_occ\n", " suffix_gender\n", " suffix_number\n", " suffix_person\n", " voc_lex\n", " voc_lex_utf8\n", "\"\"\".strip().split()\n" ] }, { "cell_type": "markdown", "id": "94010b71-cf44-419c-8bc6-7ed7b4e290e1", "metadata": {}, "source": [ "We replace pseudo none values by real None values in all remaining features.\n", "\n", "For this we load the original BHSA:" ] }, { "cell_type": "code", "execution_count": 6, "id": "24c1bc88-87d3-4eee-8991-00b1189367a3", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/ETCBC/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.9, ETCBC/bhsa/app v3, Search Reference
\n", " Data: ETCBC - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", "ย ๐Ÿ†— links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", "ย โœ… book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", "ย โœ… book name in amharic (แŠฃแˆ›แˆญแŠ›)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", "ย โœ… chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", "ย โœ… identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", "ย โœ… determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", "ย โœ… text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", "ย โœ… frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", "ย โœ… syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", "ย โœ… word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… word consonantal-Hebrew (ื‘ ืจืืฉืื™ืช ื‘ืจื ืืœื”ื™ื)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", "ย โœ… lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… lexeme pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืื™ืช ื‘ึธึผืจึธื ืึฑืœึนื”)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืึ–ื™ืช ื‘ึธึผืจึธึฃื ืึฑืœึนื”ึดึ‘ื™ื)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", "ย ๐Ÿ†— english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", "ย โœ… grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", "ย โœ… (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", "ย โœ… of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", "ย โœ… lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… lexeme consonantal-Hebrew (ื‘ ืจืืฉืื™ืชึœ ื‘ืจื ืืœื”ื™ืึœ)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", "ย โœ… lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", "ย โš ๏ธ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", "ย โœ… nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", "ย โœ… grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", "ย โœ… sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", "ย \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", "ย ๐Ÿ†— hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", "ย โœ… phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", "ย โœ… preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", "ย โœ… grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", "ย โœ… interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", "ย โœ… ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", "ย โœ… linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", "ย โœ… part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", "ย โœ… state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", "ย โœ… clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", "ย โœ… interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… interword material pointed-Hebrew (ึพ ืƒ)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", "ย โœ… text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", "ย โœ… clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", "ย โœ… univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", "ย โœ… verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", "ย โœ… root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", "ย โœ… verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", "ย โœ… vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… vocalized lexeme pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืื™ืช ื‘ืจื ืึฑืœึนื”ึดื™ื)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", "ย โœ… verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", "ย โœ… verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", "ย โœ… linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", "ย \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", "ย ๐Ÿ†— phonological transcription (bแตŠ rฤ“ลกหŒรฎแนฏ bฤrหˆฤ ส”แต‰lลhหˆรฎm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", "ย ๐Ÿ†— interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Aorig = use(f\"{ORG}/{REPO}:clone\", checkout=\"clone\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "7e7e5082-d940-4a8b-89e2-1e7b8f38d4d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "det 280219 pseudo None values\n", "gn 225693 pseudo None values\n", "ls 385975 pseudo None values\n", "nme 245354 pseudo None values\n", "nu 188676 pseudo None values\n", "pfm 381594 pseudo None values\n", "prs 381432 pseudo None values\n", "prs_gn 390964 pseudo None values\n", "prs_nu 381432 pseudo None values\n", "prs_ps 381432 pseudo None values\n", "ps 365071 pseudo None values\n", "rela 630059 pseudo None values\n", "st 245354 pseudo None values\n", "uvf 423044 pseudo None values\n", "vbe 352880 pseudo None values\n", "vbs 411184 pseudo None values\n", "vs 352880 pseudo None values\n", "vt 352880 pseudo None values\n", "TOTAL: 6376123 pseudo None values, distributed as follows:\n", "NA 3713865x\n", "n/a 1392890x\n", "absent 802598x\n", "none 385975x\n", "unknown 80795x\n" ] } ], "source": [ "noneValues = {\"n/a\", \"none\", \"NA\", \"absent\", \"unknown\"}\n", "deleteFeaturesSet = set(deleteFeatures)\n", "modifiedFeatures = {}\n", "nNones = collections.Counter()\n", "\n", "for feat in Aorig.api.Fall():\n", " if feat == \"otype\" or feat in deleteFeaturesSet:\n", " continue\n", " newData = {}\n", " for (n, v) in Aorig.api.Fs(feat).items():\n", " if v in noneValues:\n", " newData[n] = None\n", " nNones[v] += 1\n", " nData = len(newData)\n", " if nData:\n", " print(f\"{feat:<20} {nData:>7} pseudo None values\")\n", " modifiedFeatures[feat] = newData\n", " \n", "print(f\"TOTAL: {sum(nNones.values())} pseudo None values, distributed as follows:\")\n", "\n", "for (v, n) in sorted(nNones.items(), key=lambda x: (-x[1], x[0])):\n", " print(f\"{v:<12} {n:>7}x\")\n", " " ] }, { "cell_type": "markdown", "id": "aed9fb5a-05d6-4acb-b4b3-52c98c011717", "metadata": {}, "source": [ "We remove the text formats that we can no longer furnish with the thinned feature set." ] }, { "cell_type": "code", "execution_count": 9, "id": "a751a812-a5bb-4d1c-9ab4-f790c2218245", "metadata": {}, "outputs": [], "source": [ "featureMeta = dict(\n", " otext={\n", " \"dataset\": \"BHSA-min\",\n", " \"datasetName\": \"Biblia Hebraica Stuttgartensia Amstelodamensis (minimalistic)\",\n", " \"fmt:lex-default\": None,\n", " \"fmt:lex-orig-full\": None,\n", " \"fmt:lex-orig-plain\": None,\n", " \"fmt:lex-trans-full\": None,\n", " \"fmt:lex-trans-plain\": None,\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "4073ffdd-f13f-4591-90cf-551c0d5361a1", "metadata": {}, "source": [ "We clean the target location." ] }, { "cell_type": "code", "execution_count": 10, "id": "dfb1ac7c-c51d-46d1-b82d-db7f9ccd97ad", "metadata": {}, "outputs": [], "source": [ "bhsaLocation = f\"{BASE}/{ORG}/{REPO}{RELATIVE}/{VERSION}\"\n", "bhsaMinLocation = f\"{BASE}/{ORG}/{REPO_MIN}{RELATIVE}/{VERSION}\"\n", "initTree(bhsaMinLocation, fresh=True)" ] }, { "cell_type": "markdown", "id": "ac43ef84-eda8-4a84-8529-303ed2fbd706", "metadata": {}, "source": [ "This was all the preparation. Now we are going to run the modification." ] }, { "cell_type": "code", "execution_count": 11, "id": "aee0c67b-8b64-4fb3-94d2-4bb5f5295f58", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " | WARNING: Missing for text API: features: g_lex, g_lex_utf8, voc_lex_utf8\n", " 0.01s Feature overview: 109 for nodes; 6 for edges; 1 configs; 9 computed\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "modify(\n", " bhsaLocation,\n", " bhsaMinLocation,\n", " addFeatures=dict(nodeFeatures=modifiedFeatures),\n", " deleteFeatures=deleteFeatures,\n", " deleteTypes=deleteTypes,\n", " featureMeta=featureMeta,\n", " silent=\"terse\",\n", ")" ] }, { "cell_type": "markdown", "id": "12beb601-c1d1-40c0-b2d6-d712178505d7", "metadata": {}, "source": [ "# Test the new dataset" ] }, { "cell_type": "code", "execution_count": 12, "id": "d4471ec4-57bb-4569-bebe-24b50012b03c", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/github/ETCBC/bhsa-min/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/ETCBC/bhsa-min/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.37s T otype from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 6.02s T oslots from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.01s T qere from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 1.00s T g_cons_utf8 from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.83s T trailer from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.00s T qere_trailer_utf8 from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.99s T g_cons from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.04s T chapter from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.05s T book from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.04s T verse from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.00s T book@en from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.83s T trailer_utf8 from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.01s T qere_utf8 from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 1.13s T g_word_utf8 from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.00s T qere_trailer from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 1.08s T g_word from ~/github/ETCBC/bhsa-min/tf/2021\n", " | | 0.11s C __levels__ from otype, oslots, otext\n", " | | 3.05s C __order__ from otype, oslots, __levels__\n", " | | 0.15s C __rank__ from otype, __order__\n", " | | 4.92s C __levUp__ from otype, oslots, __rank__\n", " | | 3.58s C __levDown__ from otype, __levUp__, __rank__\n", " | | 0.41s C __characters__ from otext\n", " | | 2.00s C __boundary__ from otype, oslots, __rank__\n", " | | 0.06s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n", " | 0.00s T code from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.27s T det from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.18s T domain from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.84s T freq_lex from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.54s T function from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.95s T gloss from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.44s T gn from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.00s T instruction from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.18s T kind from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.06s T label from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.88s T language from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.93s T lex from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.96s T lex_utf8 from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.10s T ls from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.09s T mother from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.09s T nametype from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.40s T nme from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.56s T nu from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.00s T pargr from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.91s T pdp from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.11s T pfm from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.11s T prs from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.09s T prs_gn from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.11s T prs_nu from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.11s T prs_ps from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.15s T ps from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.05s T rela from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.18s T root from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.91s T sp from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.42s T st from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.00s T tab from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.18s T txt from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.71s T typ from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.01s T uvf from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.17s T vbe from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.04s T vbs from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.18s T vs from ~/github/ETCBC/bhsa-min/tf/2021\n", " | 0.18s T vt from ~/github/ETCBC/bhsa-min/tf/2021\n" ] }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.9, ETCBC/bhsa-min/app v3, Search Reference
\n", " Data: ETCBC - bhsa-min 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book3910938.21100
chapter929459.19100
verse2321318.38100
sentence637176.70100
clause881314.84100
phrase2532031.68100
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis (minimalistic)\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", "ย โœ… book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", "ย โœ… book name in english (English)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", "ย โœ… chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", "ย โœ… identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", "ย โœ… determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", "ย โœ… text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", "ย โœ… frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", "ย โœ… syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", "ย โœ… word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… word consonantal-Hebrew (ื‘ ืจืืฉืื™ืช ื‘ืจื ืืœื”ื™ื)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืึ–ื™ืช ื‘ึธึผืจึธึฃื ืึฑืœึนื”ึดึ‘ื™ื)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", "ย ๐Ÿ†— english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", "ย โœ… grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "instruction\n", "
\n", "
str
\n", "\n", "ย โ“ change the set of actors (.e; d.; ..; .#; .q; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "kind\n", "
\n", "
str
\n", "\n", "ย โœ… clause kind (VC (verbal); NC (nominal); WP (without predication))\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", "ย โœ… (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", "ย โœ… of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", "ย โœ… lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… lexeme consonantal-Hebrew (ื‘ ืจืืฉืื™ืชึœ ื‘ืจื ืืœื”ื™ืึœ)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", "ย โœ… lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", "ย โš ๏ธ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", "ย โœ… nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", "ย โœ… grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", "ย \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", "ย ๐Ÿ†— hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", "ย โœ… phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", "ย โœ… preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", "ย โœ… pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", "ย โœ… grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", "ย โœ… interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", "ย โœ… linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "root\n", "
\n", "
str
\n", "\n", "ย โš ๏ธ root of word or lexeme (R>C CMH XCK )\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", "ย โœ… part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", "ย โœ… state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", "ย โœ… clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", "ย โœ… interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", "ย โœ… interword material pointed-Hebrew (ึพ ืƒ)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", "ย โœ… text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", "ย โœ… clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", "ย โœ… univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", "ย โœ… verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", "ย โœ… root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", "ย โœ… verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", "ย โœ… verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", "ย โœ… verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", "ย โœ… linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", "ย \n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(f\"{ORG}/{REPO_MIN}:clone\", checkout=\"clone\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "891c8562-bb02-41b8-a754-605b7fb6f628", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
sentence
clause WayX
phrase CP Conj
ื•ึท
phrase VP Pred
ื™ึดึผืชึตึผึฅืŸ
phrase PP Objc
ืึนืชึธึ›ื
phrase NP Subj
ืึฑืœึนื”ึดึ–ื™ื
phrase PP Cmpl
ื‘ึดึผ
ืจึฐืงึดึฃื™ืขึท
ื”ึท
ืฉึธึผืืžึธึ‘ื™ึดื
clause InfC Adju
phrase VP Pred
ืœึฐ
ื”ึธืึดึ–ื™ืจ
phrase PP Cmpl
ืขึทืœึพ
ื”ึธ
ืึธึฝืจึถืฅืƒ
clause InfC Coor
phrase CP Conj
ื•ึฐ
phrase VP Pred
ืœึด
ืžึฐืฉึนืืœึ™
phrase PP Cmpl
ื‘ึทึผ
ื™ึนึผึฃื•ื
ื•ึผ
ื‘ึท
ืœึทึผึ”ื™ึฐืœึธื”
clause InfC Coor
phrase CP Conj
ื•ึผึฝ
phrase VP Pred
ืœึฒ
ื”ึทื‘ึฐื“ึดึผึ”ื™ืœ
phrase PP Cmpl
ื‘ึตึผึฅื™ืŸ
ื”ึธ
ืึนึ–ื•ืจ
ื•ึผ
ื‘ึตึฃื™ืŸ
ื”ึท
ื—ึนึ‘ืฉึถืืšึฐ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "s = A.api.F.otype.s(\"sentence\")[45]\n", "\n", "A.pretty(s, multiFeatures=False)" ] }, { "cell_type": "markdown", "id": "27260c6e-5a11-482f-9e57-bf450cdf8813", "metadata": {}, "source": [ "# Memory footprint" ] }, { "cell_type": "code", "execution_count": 14, "id": "cbe9bf41-8fc0-4d73-ac44-4a7ae8b1b922", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " " ] }, { "data": { "text/markdown": [ "\n", "# 62 features\n", "\n", "feature | members | size in bytes\n", "--- | --- | ---\n", "__levUp__ | 855,822 | 90,952,232\n", "oslots | 3 | 48,010,736\n", "__boundary__ | 2 | 45,968,624\n", "__levDown__ | 429,232 | 45,465,556\n", "g_word_utf8 | 426,590 | 43,108,023\n", "g_word | 426,590 | 38,366,975\n", "g_cons_utf8 | 426,590 | 35,235,354\n", "g_cons | 426,590 | 34,229,846\n", "lex_utf8 | 426,590 | 33,511,928\n", "lex | 426,590 | 33,390,506\n", "gloss | 426,590 | 33,261,517\n", "freq_lex | 426,590 | 32,920,796\n", "trailer_utf8 | 426,590 | 32,917,176\n", "pdp | 426,590 | 32,916,861\n", "sp | 426,590 | 32,916,861\n", "trailer | 426,590 | 32,916,804\n", "language | 426,590 | 32,916,231\n", "__order__ | 855,822 | 30,809,632\n", "typ | 341,334 | 20,046,107\n", "function | 253,203 | 17,577,069\n", "nu | 237,914 | 17,147,593\n", "gn | 200,897 | 16,111,064\n", "nme | 181,236 | 15,561,372\n", "st | 181,236 | 15,560,606\n", "det | 113,892 | 8,432,040\n", "txt | 88,131 | 7,717,905\n", "domain | 88,131 | 7,710,828\n", "kind | 88,131 | 7,710,781\n", "mother | 21,403 | 6,124,096\n", "vs | 73,710 | 4,686,725\n", "vbe | 73,710 | 4,686,321\n", "vt | 73,710 | 4,685,832\n", "root | 72,051 | 4,678,358\n", "ps | 61,519 | 4,344,213\n", "prs | 45,158 | 3,886,968\n", "prs_ps | 45,158 | 3,886,105\n", "prs_nu | 45,158 | 3,886,054\n", "pfm | 44,996 | 3,881,866\n", "__rank__ | 855,822 | 3,624,652\n", "otype | 4 | 3,434,407\n", "label | 23,213 | 3,330,331\n", "ls | 40,615 | 2,448,762\n", "prs_gn | 35,626 | 2,308,428\n", "nametype | 35,506 | 2,305,527\n", "chapter | 24,142 | 1,990,976\n", "book | 24,181 | 1,990,047\n", "verse | 23,213 | 1,965,692\n", "__sections__ | 2 | 1,703,884\n", "rela | 21,403 | 1,189,832\n", "vbs | 15,406 | 1,021,735\n", "qere_utf8 | 1,892 | 263,785\n", "uvf | 3,546 | 247,082\n", "qere | 1,892 | 200,631\n", "qere_trailer_utf8 | 1,892 | 127,126\n", "qere_trailer | 1,892 | 127,037\n", "__characters__ | 6 | 42,824\n", "book@en | 39 | 4,454\n", "__levels__ | 7 | 1,519\n", "code | 0 | 64\n", "instruction | 0 | 64\n", "pargr | 0 | 64\n", "tab | 0 | 64\n", "TOTAL | 11,127,528 | 916,466,548" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.footprint()" ] }, { "cell_type": "markdown", "id": "237712cf-5130-4b00-8efa-a24e95fa1423", "metadata": {}, "source": [ "For comparison, we show the footprint of the complete BHSA:" ] }, { "cell_type": "code", "execution_count": 26, "id": "fd3ab848-bbc3-4f7a-853b-23185e0061d9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " " ] }, { "data": { "text/markdown": [ "\n", "# 93 features\n", "\n", "feature | members | size in bytes\n", "--- | --- | ---\n", "__levUp__ | 1,446,831 | 549,464,428\n", "__levDown__ | 1,020,241 | 136,906,600\n", "oslots | 3 | 121,886,820\n", "__boundary__ | 2 | 107,077,088\n", "number | 1,254,391 | 99,760,440\n", "rela | 722,716 | 62,180,387\n", "typ | 699,570 | 61,534,048\n", "__order__ | 1,446,831 | 52,085,956\n", "mother | 182,269 | 50,752,708\n", "freq_lex | 435,820 | 41,925,288\n", "g_word_utf8 | 426,590 | 41,387,515\n", "g_word | 426,590 | 38,366,975\n", "phono | 426,590 | 38,354,715\n", "rank_lex | 435,820 | 36,117,916\n", "det | 520,735 | 35,552,335\n", "g_cons_utf8 | 426,590 | 34,967,885\n", "g_lex_utf8 | 426,590 | 34,738,775\n", "g_cons | 426,590 | 34,229,846\n", "lex_utf8 | 435,820 | 34,190,000\n", "g_lex | 426,590 | 34,079,807\n", "voc_lex_utf8 | 435,820 | 33,881,594\n", "lex | 435,820 | 33,648,946\n", "voc_lex | 435,820 | 33,624,242\n", "gloss | 435,820 | 33,519,957\n", "sp | 435,820 | 33,175,301\n", "language | 435,820 | 33,174,671\n", "ls | 426,975 | 32,927,695\n", "vs | 426,590 | 32,917,488\n", "prs | 426,590 | 32,917,243\n", "nme | 426,590 | 32,917,143\n", "trailer_utf8 | 426,590 | 32,917,109\n", "vbe | 426,590 | 32,917,085\n", "pdp | 426,590 | 32,916,861\n", "trailer | 426,590 | 32,916,804\n", "vbs | 426,590 | 32,916,682\n", "pfm | 426,590 | 32,916,677\n", "vt | 426,590 | 32,916,595\n", "uvf | 426,590 | 32,916,425\n", "nu | 426,590 | 32,916,380\n", "ps | 426,590 | 32,916,380\n", "gn | 426,590 | 32,916,327\n", "prs_gn | 426,590 | 32,916,327\n", "prs_ps | 426,590 | 32,916,324\n", "phono_trailer | 426,590 | 32,916,322\n", "st | 426,590 | 32,916,321\n", "prs_nu | 426,590 | 32,916,273\n", "function | 253,203 | 17,577,069\n", "code | 90,704 | 8,877,720\n", "otype | 4 | 8,162,830\n", "pargr | 90,704 | 7,977,817\n", "tab | 90,704 | 7,783,508\n", "txt | 88,131 | 7,717,905\n", "domain | 88,131 | 7,710,828\n", "__rank__ | 1,446,831 | 6,149,120\n", "label | 68,392 | 5,906,221\n", "crossref | 3,783 | 2,812,004\n", "nametype | 38,117 | 2,378,635\n", "chapter | 24,142 | 1,990,976\n", "book | 24,181 | 1,990,047\n", "verse | 23,213 | 1,965,692\n", "__sections__ | 2 | 1,704,976\n", "qere_utf8 | 1,892 | 241,179\n", "qere | 1,892 | 200,631\n", "qere_trailer_utf8 | 1,892 | 127,115\n", "qere_trailer | 1,892 | 127,037\n", "__characters__ | 12 | 76,835\n", "book@am | 39 | 5,940\n", "book@bn | 39 | 5,748\n", "book@ru | 39 | 5,720\n", "book@el | 39 | 5,716\n", "book@hi | 39 | 5,676\n", "book@pa | 39 | 5,654\n", "book@fa | 39 | 5,648\n", "book@ur | 39 | 5,628\n", "book@syc | 39 | 5,616\n", "book@he | 39 | 5,570\n", "book@ar | 39 | 5,564\n", "book@ja | 39 | 5,472\n", "book@ko | 39 | 5,382\n", "book@zh | 39 | 5,325\n", "book@es | 39 | 4,877\n", "book@pt | 39 | 4,861\n", "book@tr | 39 | 4,814\n", "book@fr | 39 | 4,767\n", "book@yo | 39 | 4,742\n", "book@da | 39 | 4,557\n", "book@de | 39 | 4,511\n", "book@nl | 39 | 4,496\n", "book@sw | 39 | 4,474\n", "book@en | 39 | 4,454\n", "book@id | 39 | 4,442\n", "book@la | 39 | 4,439\n", "__levels__ | 13 | 2,830\n", "TOTAL | 25,073,133 | 2,596,543,772" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Aorig.footprint()" ] }, { "cell_type": "markdown", "id": "50bb752d-cb74-4f07-92b0-710bcc733321", "metadata": {}, "source": [ "A reduction from **2.6** GB to **0.9** GB" ] }, { "cell_type": "markdown", "id": "385007f5-433e-4b14-a6d2-9c16515b550e", "metadata": {}, "source": [ "# Publish\n", "\n", "Make a zip file to release" ] }, { "cell_type": "code", "execution_count": 21, "id": "c70579dc-a1c9-4c0c-87b0-77a02511c49f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data to be zipped:\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: no local release info found.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Maybe you have to do go to this repo and do `git pull --tags`\n", "We'll fetch the local commit info anyway.\n", "\tOK app (v?? ca8267) : ~/github/ETCBC/bhsa-min/app\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: no local release info found.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Maybe you have to do go to this repo and do `git pull --tags`\n", "We'll fetch the local commit info anyway.\n", "\tOK main data (v?? ca8267) : ~/github/ETCBC/bhsa-min/tf/2021\n", "Writing zip file ...\n", "Result: ~/Downloads/github/ETCBC/bhsa-min/complete.zip\n", "Data to be zipped:\n", "\tOK app (v1.8 157309) : ~/github/ETCBC/bhsa/app\n", "\tOK main data (v1.8 157309) : ~/github/ETCBC/bhsa/tf/2021\n", "\tOK module phono (v2.1 bd97bc) : ~/github/ETCBC/phono/tf/2021\n", "\tOK module parallels (v2.1 cf333f) : ~/github/ETCBC/parallels/tf/2021\n", "Writing zip file ...\n", "Result: ~/Downloads/github/ETCBC/bhsa/complete.zip\n" ] } ], "source": [ "A.zipAll()\n", "Aorig.zipAll()" ] }, { "cell_type": "code", "execution_count": 24, "id": "1990b8f7-7a35-46a4-a298-049996d61fca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 me staff 12580582 May 9 14:31 /Users/me/Downloads/github/ETCBC/bhsa-min/complete.zip\n", "-rw-r--r-- 1 me staff 33954656 May 9 14:31 /Users/me/Downloads/github/ETCBC/bhsa/complete.zip\n" ] } ], "source": [ "!ls -l ~/Downloads/github/ETCBC/bhsa-min/complete.zip\n", "!ls -l ~/Downloads/github/ETCBC/bhsa/complete.zip" ] }, { "cell_type": "markdown", "id": "657c8814-a797-4c07-8e5d-b31104c8eea3", "metadata": { "tags": [] }, "source": [ "# Browse\n", "\n", "Press `i` twice to quit the browser." ] }, { "cell_type": "code", "execution_count": 25, "id": "9270d742-821f-459e-a92b-3bcfa05ea094", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 11.4.9\n", "Starting new kernel listening on 14907\n", "Loading data for ETCBC/bhsa-min. Please wait ...\n", "Setting up TF kernel for ETCBC/bhsa-min \n", "**Locating corpus resources ...**\n", "Using app in ~/github/ETCBC/bhsa-min/app:\n", "\trepo clone offline under ~/github (local github)\n", "Using data in ~/github/ETCBC/bhsa-min/tf/2021:\n", "\trepo clone offline under ~/github (local github)\n", "TF setup done.\n", "Starting new webserver listening on 24907\n", "\u001b[31m\u001b[1mWARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.\u001b[0m\n", " * Running on http://localhost:24907\n", "\u001b[33mPress CTRL+C to quit\u001b[0m\n", "Opening ETCBC/bhsa-min in browser\n", "Press to stop the TF browser\n", "Kernel listening at port 14907\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET / HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/base.css HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/display.css HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/highlight.css HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/fonts.css HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/index.css HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/fontawesome.css HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/jquery.js HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/tf3.0.js HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/icon.png HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/huc.png HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /data/static/logo.png HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/fonts/fa-solid-900.woff2 HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/fonts/SILEOT.woff HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/fonts/fa-regular-400.woff2 HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"GET /server/static/favicon.ico HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:34] \"POST /passage HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:38] \"POST /passage HTTP/1.1\" 200 -\n", "127.0.0.1 - - [09/May/2023 14:33:40] \"POST /passage/3 HTTP/1.1\" 200 -\n", "^C\n", "keyboard interrupt!\n", "\n", "TF web server has stopped\n", "TF kernel has stopped\n" ] } ], "source": [ "!tf" ] }, { "cell_type": "code", "execution_count": null, "id": "34518fbf-5e07-49e0-8df0-ff5a153fed1c", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" } }, "nbformat": 4, "nbformat_minor": 5 }