{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "a9b3b3db-b82a-41da-971e-a10c51b61b0d", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "18f23ba6-2e6f-4bc0-82bb-1d4a15f648be", "metadata": {}, "outputs": [], "source": [ "from tf.app import use\n", "from tf.browser.ner.annotate import Annotate" ] }, { "cell_type": "markdown", "id": "0db5ab44-5de2-473f-aa82-e60f3591de92", "metadata": {}, "source": [ "# Mark all occurrences of אלהים in Genesis as a named entity, but no others\n", "\n", "It is often the case that the same names in different books refer to different people/locations/other entities.\n", "\n", "Here we show how to mark up the occurrences of a name but only in as far they occur in the same book.\n", "\n", "More concretely, we mark up all occurrences of אלהים in Genesis.\n", "\n", "We use the basic Annotate API for this." ] }, { "cell_type": "code", "execution_count": 3, "id": "687d8bfe-5230-45a6-a780-c6ed6b7adf21", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/ETCBC/bhsa/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/phono/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " TF: TF API 12.1.2, ETCBC/bhsa/app v3, Search Reference
\n", " Data: ETCBC - bhsa 2021, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book3910938.21100
chapter929459.19100
lex923046.22100
verse2321318.38100
half_verse451799.44100
sentence637176.70100
sentence_atom645146.61100
clause881314.84100
clause_atom907044.70100
phrase2532031.68100
phrase_atom2675321.59100
subphrase1138501.4238
word4265901.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Parallel Passages\n", "
\n", "\n", "
\n", "
\n", "crossref\n", "
\n", "
int
\n", "\n", " 🆗 links between similar passages\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ book name in Latin (Genesis; Numeri; Reges1; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "book@ll\n", "
\n", "
str
\n", "\n", " ✅ book name in amharic (ኣማርኛ)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ chapter number (1; 2; 3; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "code\n", "
\n", "
int
\n", "\n", " ✅ identifier of a clause atom relationship (0; 74; 367; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "det\n", "
\n", "
str
\n", "\n", " ✅ determinedness of phrase(atom) (det; und; NA.)\n", "\n", "
\n", "\n", "
\n", "
\n", "domain\n", "
\n", "
str
\n", "\n", " ✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)\n", "\n", "
\n", "\n", "
\n", "
\n", "freq_lex\n", "
\n", "
int
\n", "\n", " ✅ frequency of lexemes\n", "\n", "
\n", "\n", "
\n", "
\n", "function\n", "
\n", "
str
\n", "\n", " ✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_cons_utf8\n", "
\n", "
str
\n", "\n", " ✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)\n", "\n", "
\n", "\n", "
\n", "
\n", "g_word_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " 🆗 english translation of lexeme (beginning create god(s))\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ grammatical gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "label\n", "
\n", "
str
\n", "\n", " ✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)\n", "\n", "
\n", "\n", "
\n", "
\n", "language\n", "
\n", "
str
\n", "\n", " ✅ of word or lexeme (Hebrew; Aramaic.)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)\n", "\n", "
\n", "\n", "
\n", "
\n", "ls\n", "
\n", "
str
\n", "\n", " ✅ lexical set, subclassification of part-of-speech (card; ques; mult)\n", "\n", "
\n", "\n", "
\n", "
\n", "nametype\n", "
\n", "
str
\n", "\n", " ⚠️ named entity type (pers; mens; gens; topo; ppde.)\n", "\n", "
\n", "\n", "
\n", "
\n", "nme\n", "
\n", "
str
\n", "\n", " ✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ grammatical number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
int
\n", "\n", " ✅ sequence number of an object within its context\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "pargr\n", "
\n", "
str
\n", "\n", " 🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pdp\n", "
\n", "
str
\n", "\n", " ✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "pfm\n", "
\n", "
str
\n", "\n", " ✅ preformative consonantal-transliterated (absent; n/a; J, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_gn\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix gender (m; f; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_nu\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix number (sg; du; pl; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "prs_ps\n", "
\n", "
str
\n", "\n", " ✅ pronominal suffix person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "ps\n", "
\n", "
str
\n", "\n", " ✅ grammatical person (p1; p2; p3; NA; unknown.)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere\n", "
\n", "
str
\n", "\n", " ✅ word pointed-transliterated masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material -pointed-transliterated (Masoretic correction)\n", "\n", "
\n", "\n", "
\n", "
\n", "qere_utf8\n", "
\n", "
str
\n", "\n", " ✅ word pointed-Hebrew masoretic reading correction\n", "\n", "
\n", "\n", "
\n", "
\n", "rank_lex\n", "
\n", "
int
\n", "\n", " ✅ ranking of lexemes based on freqnuecy\n", "\n", "
\n", "\n", "
\n", "
\n", "rela\n", "
\n", "
str
\n", "\n", " ✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ part-of-speech (art; verb; subs; nmpr, ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "st\n", "
\n", "
str
\n", "\n", " ✅ state of a noun (a (absolute); c (construct); e (emphatic).)\n", "\n", "
\n", "\n", "
\n", "
\n", "tab\n", "
\n", "
int
\n", "\n", " ✅ clause atom: its level in the linguistic embedding\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-transliterated (& 00 05 00_P ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "trailer_utf8\n", "
\n", "
str
\n", "\n", " ✅ interword material pointed-Hebrew (־ ׃)\n", "\n", "
\n", "\n", "
\n", "
\n", "txt\n", "
\n", "
str
\n", "\n", " ✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)\n", "\n", "
\n", "\n", "
\n", "
\n", "typ\n", "
\n", "
str
\n", "\n", " ✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)\n", "\n", "
\n", "\n", "
\n", "
\n", "uvf\n", "
\n", "
str
\n", "\n", " ✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbe\n", "
\n", "
str
\n", "\n", " ✅ verbal ending consonantal-transliterated (n/a; W; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "vbs\n", "
\n", "
str
\n", "\n", " ✅ root formation consonantal-transliterated (absent; n/a; H; ...)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ verse number\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)\n", "\n", "
\n", "\n", "
\n", "
\n", "voc_lex_utf8\n", "
\n", "
str
\n", "\n", " ✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)\n", "\n", "
\n", "\n", "
\n", "
\n", "vs\n", "
\n", "
str
\n", "\n", " ✅ verbal stem (qal; piel; hif; apel; pael)\n", "\n", "
\n", "\n", "
\n", "
\n", "vt\n", "
\n", "
str
\n", "\n", " ✅ verbal tense (perf; impv; wayq; infc)\n", "\n", "
\n", "\n", "
\n", "
\n", "mother\n", "
\n", "
none
\n", "\n", " ✅ linguistic dependency between textual objects\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
Phonetic Transcriptions\n", "
\n", "\n", "
\n", "
\n", "phono\n", "
\n", "
str
\n", "\n", " 🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)\n", "\n", "
\n", "\n", "
\n", "
\n", "phono_trailer\n", "
\n", "
str
\n", "\n", " 🆗 interword material in phonological transcription\n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: ETCBC/bhsa
  3. appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
  4. commit: gb112c161cfd21eae403d51a2733740d8743460e7
  5. css: ''
  6. dataDisplay:
    • exampleSectionHtml:<code>Genesis 1:1</code> (use <a href=\"https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf\" target=\"_blank\">English book names</a>)
    • excludedFeatures:
      • g_uvf_utf8
      • g_vbs
      • kq_hybrid
      • languageISO
      • g_nme
      • lex0
      • is_root
      • g_vbs_utf8
      • g_uvf
      • dist
      • root
      • suffix_person
      • g_vbe
      • dist_unit
      • suffix_number
      • distributional_parent
      • kq_hybrid_utf8
      • crossrefSET
      • instruction
      • g_prs
      • lexeme_count
      • rank_occ
      • g_pfm_utf8
      • freq_occ
      • crossrefLCS
      • functional_parent
      • g_pfm
      • g_nme_utf8
      • g_vbe_utf8
      • kind
      • g_prs_utf8
      • suffix_gender
      • mother_object_type
    • noneValues:
      • absent
      • n/a
      • none
      • unknown
      • no value
      • NA
  7. docs:
    • docBase: {docRoot}/{repo}
    • docExt: ''
    • docPage: ''
    • docRoot: https://{org}.github.io
    • featurePage: 0_home
  8. interfaceDefaults: {}
  9. isCompatible: True
  10. local: local
  11. localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
  12. provenanceSpec:
    • corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
    • doi: 10.5281/zenodo.1007624
    • extraData: ner
    • moduleSpecs:
      • :
        • backend: no value
        • corpus: Phonetic Transcriptions
        • docUrl:https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
        • doi: 10.5281/zenodo.1007636
        • org: ETCBC
        • relative: /tf
        • repo: phono
      • :
        • backend: no value
        • corpus: Parallel Passages
        • docUrl:https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
        • doi: 10.5281/zenodo.1007642
        • org: ETCBC
        • relative: /tf
        • repo: parallels
    • org: ETCBC
    • relative: /tf
    • repo: bhsa
    • version: 2021
    • webBase: https://shebanq.ancient-data.org/hebrew
    • webHint: Show this on SHEBANQ
    • webLang: la
    • webLexId: True
    • webUrl:{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
    • webUrlLex: {webBase}/word?version={version}&id=<lid>
  13. release: v1.8.1
  14. typeDisplay:
    • clause:
      • label: {typ} {rela}
      • style: ''
    • clause_atom:
      • hidden: True
      • label: {code}
      • level: 1
      • style: ''
    • half_verse:
      • hidden: True
      • label: {label}
      • style: ''
      • verselike: True
    • lex:
      • featuresBare: gloss
      • label: {voc_lex_utf8}
      • lexOcc: word
      • style: orig
      • template: {voc_lex_utf8}
    • phrase:
      • label: {typ} {function}
      • style: ''
    • phrase_atom:
      • hidden: True
      • label: {typ} {rela}
      • level: 1
      • style: ''
    • sentence:
      • label: {number}
      • style: ''
    • sentence_atom:
      • hidden: True
      • label: {number}
      • level: 1
      • style: ''
    • subphrase:
      • hidden: True
      • label: {number}
      • style: ''
    • word:
      • features: pdp vs vt
      • featuresBare: lex:gloss
  15. writing: hbo
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa\")" ] }, { "cell_type": "markdown", "id": "98c2c8b0-faab-4bb7-a73e-d848bf7a0064", "metadata": {}, "source": [ "## Get the desired occurrences by normal TF methods\n", "\n", "We look for the words whose lexeme is `>LHJM/` in Genesis." ] }, { "cell_type": "code", "execution_count": 4, "id": "b9a5407b-c9ee-4e6c-8877-f9255d8c8365", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.16s 219 results\n" ] } ], "source": [ "words = [result[1] for result in A.search(\"\"\"\n", "book book=Genesis\n", " word lex=>LHJM/\n", "\"\"\")]" ] }, { "cell_type": "markdown", "id": "eb9e9ea5-8378-456d-be68-c2495015be52", "metadata": {}, "source": [ "## Add the entity with suitable values" ] }, { "cell_type": "code", "execution_count": 5, "id": "4d3f1723-d402-4706-954c-a90a018944a2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "NE = A.makeNer()" ] }, { "cell_type": "markdown", "id": "e30151d4-184d-46b5-b550-b766a28459a8", "metadata": {}, "source": [ "We set up a set to add the entities to:" ] }, { "cell_type": "code", "execution_count": 6, "id": "844d19ad-e149-4d32-8ca2-9b98d7591e0e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Annotation set elohim has 215 annotations\n" ] } ], "source": [ "NE.setSet(\"elohim\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "b8ee13f2-0d22-4581-96f1-4a8100ecf51a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Already present: 0 x\n", "Added: 219 x\n" ] } ], "source": [ "NE.addEntity((\"elohim\", \"per\"), tuple((word,) for word in words), silent=False)" ] }, { "cell_type": "markdown", "id": "44f6cc13-ee92-4731-be31-c51fd1197260", "metadata": {}, "source": [ "## Show the new entities\n", "\n", "Now we can retrieve the verses where these entities occur:" ] }, { "cell_type": "code", "execution_count": 8, "id": "b562c667-9d5f-4b0a-ab70-3a9f368ccd26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "190 verses\n" ] } ], "source": [ "content = NE.filterContent(eVals=(\"elohim\", \"per\"))" ] }, { "cell_type": "markdown", "id": "d44a14e4-f2c1-4700-b0f6-600d76920f32", "metadata": {}, "source": [ "Let's show the first and last 5 verses:" ] }, { "cell_type": "code", "execution_count": 9, "id": "e863f8df-4ac2-4823-8d85-05194725db22", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Genesis 1:1 בראשׁית ברא 1elohim per 219אלהים 1את השׁמים ואת הארץ׃
Genesis 1:2 והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח 1elohim per 219אלהים 1מרחפת על־פני המים׃
Genesis 1:3 ויאמר 1elohim div 2151elohim per 219אלהים 11יהי אור ויהי־אור׃
Genesis 1:4 וירא 1elohim div 2151elohim per 219אלהים 11את־האור כי־טוב ויבדל 1elohim div 2151elohim per 219אלהים 11בין האור ובין החשׁך׃
Genesis 1:5 ויקרא 1elohim div 2151elohim per 219אלהים׀ 11לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "NE.showContent(content, end=5)" ] }, { "cell_type": "code", "execution_count": 10, "id": "b3ec8aab-41d5-44ac-baf4-28451425d375", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Genesis 48:21 ויאמר ישׂראל אל־יוסף הנה אנכי מת והיה 1elohim div 2151elohim per 219אלהים 11עמכם והשׁיב אתכם אל־ארץ אבתיכם׃
Genesis 50:17 כה־תאמרו ליוסף אנא שׂא נא פשׁע אחיך וחטאתם כי־רעה גמלוך ועתה שׂא נא לפשׁע עבדי 1elohim div 2151elohim per 219אלהי 11אביך ויבך יוסף בדברם אליו׃
Genesis 50:19 ויאמר אלהם יוסף אל־תיראו כי התחת 1elohim div 2151elohim per 219אלהים 11אני׃
Genesis 50:20 ואתם חשׁבתם עלי רעה 1elohim div 2151elohim per 219אלהים 11חשׁבה לטבה למען עשׂה כיום הזה להחית עם־רב׃
Genesis 50:24 ויאמר יוסף אל־אחיו אנכי מת ו1elohim per 219אלהים 1פקד יפקד אתכם והעלה אתכם מן־הארץ הזאת אל־הארץ אשׁר נשׁבע לאברהם ליצחק וליעקב׃
Genesis 50:25 וישׁבע יוסף את־בני ישׂראל לאמר פקד יפקד 1elohim per 219אלהים 1אתכם והעלתם את־עצמתי מזה׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "NE.showContent(content, start=185)" ] }, { "cell_type": "markdown", "id": "68612e50-5c45-4180-b5b9-a276d61c642c", "metadata": {}, "source": [ "## Correct the new entities\n", "\n", "If we want to change the entity kind from `per` to `div`, we remove these entities first: " ] }, { "cell_type": "code", "execution_count": 11, "id": "6dca0297-91aa-4c7e-a0d3-ace6dcd7f22f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not present: 0 x\n", "Deleted: 219 x\n" ] } ], "source": [ "NE.delEntity((\"elohim\", \"per\"), tuple((word,) for word in words), silent=False)" ] }, { "cell_type": "markdown", "id": "d20d02dc-5c3a-400b-a00e-8ab320012af1", "metadata": {}, "source": [ "And then add them with the right kind:" ] }, { "cell_type": "code", "execution_count": 12, "id": "320dca86-1c22-43ab-90f5-a751a1684e3f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Already present: 215 x\n", "Added: 4 x\n" ] } ], "source": [ "NE.addEntity((\"elohim\", \"div\"), tuple((word,) for word in words), silent=False)" ] }, { "cell_type": "markdown", "id": "f7e3833b-654b-4749-8117-6746f9863f17", "metadata": {}, "source": [ "With a check:" ] }, { "cell_type": "code", "execution_count": 13, "id": "8533099e-0364-49a1-9340-039a558fbe72", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "190 verses\n" ] } ], "source": [ "content = NE.filterContent(eVals=(\"elohim\", \"div\"))" ] }, { "cell_type": "code", "execution_count": 14, "id": "6f8fb474-47fc-4c7d-a898-dec8b0a2db2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Genesis 1:1 בראשׁית ברא 1elohim div 219אלהים 1את השׁמים ואת הארץ׃
Genesis 1:2 והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח 1elohim div 219אלהים 1מרחפת על־פני המים׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "NE.showContent(content, end=2)" ] }, { "cell_type": "markdown", "id": "1796e514-52d6-4b5b-93e6-c264e7569a69", "metadata": {}, "source": [ "## Fine tune the new entities\n", "\n", "If you start the TF browser and switch to the annotate tool, you'll find the new set `elohim` there, and you\n", "can make manual adjustments to individual occurrences." ] }, { "cell_type": "code", "execution_count": 15, "id": "ab98f324-a69d-44ea-8189-6ddf64e53d71", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 12.1.2\n", "appName='ETCBC/bhsa'\n", "slug='ETCBC/bhsa'\n", "Loading TF corpus data. Please wait ...\n", "Setting up TF browser for ETCBC/bhsa \n", "**Locating corpus resources ...**\n", "Using app in ~/text-fabric-data/github/ETCBC/bhsa/app:\n", "\trv1.8.1=#gb112c161cfd21eae403d51a2733740d8743460e7 offline under ~/text-fabric-data/github (local release)\n", "Using data in ~/text-fabric-data/github/ETCBC/bhsa/tf/2021:\n", "\trv1.8.1=#gb112c161cfd21eae403d51a2733740d8743460e7 offline under ~/text-fabric-data/github (local release)\n", "Using data in ~/text-fabric-data/github/ETCBC/phono/tf/2021:\n", "\trv2.1=#gaba4367b49750089e4e4122415a77cac43bd97bc offline under ~/text-fabric-data/github (local release)\n", "Using data in ~/text-fabric-data/github/ETCBC/parallels/tf/2021:\n", "\trv2.1=#gf45f6cc3c4f933dba6e649f49cdb14a40dcf333f offline under ~/text-fabric-data/github (local release)\n", "TF setup done.\n", "\u001b[31m\u001b[1mWARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.\u001b[0m\n", " * Running on http://localhost:16650\n", "\u001b[33mPress CTRL+C to quit\u001b[0m\n", "o-o-o\n", "Opening corpus in Chrome browser\n", "o-o-o\n", "Press to stop the TF browser\n", "127.0.0.1 - - [01/Nov/2023 11:23:49] \"GET /ner/index HTTP/1.1\" 200 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /browser/static/colors.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /browser/static/highlight.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /ner/static/index.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /browser/static/fonts.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /ner/static/base.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /browser/static/display.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /browser/static/jquery.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:52] \"\u001b[36mGET /ner/static/tool.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"POST /ner/index HTTP/1.1\" 200 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /browser/static/colors.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /browser/static/display.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /browser/static/highlight.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /browser/static/fonts.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /ner/static/base.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /ner/static/index.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /browser/static/jquery.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:23:59] \"\u001b[36mGET /ner/static/tool.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"POST /ner/index HTTP/1.1\" 200 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /browser/static/colors.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /browser/static/display.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /browser/static/highlight.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /browser/static/fonts.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /ner/static/base.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /ner/static/index.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /browser/static/jquery.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:02] \"\u001b[36mGET /ner/static/tool.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"POST /ner/index HTTP/1.1\" 200 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /browser/static/colors.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /browser/static/display.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /browser/static/highlight.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /browser/static/fonts.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /ner/static/base.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /ner/static/index.css HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /browser/static/jquery.js HTTP/1.1\u001b[0m\" 304 -\n", "127.0.0.1 - - [01/Nov/2023 11:24:22] \"\u001b[36mGET /ner/static/tool.js HTTP/1.1\u001b[0m\" 304 -\n", "^C\n", "TF web server has stopped\n" ] } ], "source": [ "!tf --chrome --tool=ner ETCBC/bhsa " ] }, { "cell_type": "markdown", "id": "98577ec1-d402-4915-9d75-7e805a63fd51", "metadata": {}, "source": [ "And then you see something like this (after selecting the `elohim` set and clicking the `elohim` entity\n", "in the left column.\n", "\n", "![ner1](images/ner1.png)" ] }, { "cell_type": "markdown", "id": "ecb1f8d7-6f21-40c1-ac0e-6412dd753518", "metadata": {}, "source": [ "You can click and delete some of the occurrences, let's delete the first two and the last two occurrences ...\n", "\n", "![ner2](images/ner2.png)" ] }, { "cell_type": "markdown", "id": "09f55e7e-ee39-4252-a06a-fdc62eaec907", "metadata": {}, "source": [ "Now, back in our notebook, we retrieve the occurrences of the `elohim` entity, after this change.\n", "But first we have to stop the TF browser and reload the `elohim` set.\n", "\n", "The fact that we need to stop the TF browser before we can continue with our notebook is an indication\n", "that this is not the recommended way to start the TF browser.\n", "\n", "Better start it independently from a command prompt!\n", "Anyway, you can stop the browser by interrupting the kernel (the black square button in the toolbar at the top\n", "of the notebook interface)." ] }, { "cell_type": "code", "execution_count": 16, "id": "1dbe353d-69dd-4875-98d2-29b63f3c39e4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Annotation set elohim has 215 annotations\n" ] } ], "source": [ "NE.setSet(\"elohim\")" ] }, { "cell_type": "code", "execution_count": 17, "id": "4b62ce31-ce99-472d-9642-c6b4ab44052e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "186 verses\n" ] } ], "source": [ "content = NE.filterContent(eVals=(\"elohim\", \"div\"))" ] }, { "cell_type": "markdown", "id": "fa714bc3-f0fc-4dae-800a-4df7be925c51", "metadata": {}, "source": [ "Can we find the exact occurrences that we have deleted?\n", "\n", "Yes, but there is no dedicated method (yet) to do it.\n", "\n", "But we have to dig in the various representations of the entity data that the tool maintains under the hood,\n", "see\n", "[processed data](https://annotation.github.io/text-fabric/tf/browser/ner/data.html#tf.browser.ner.data.Data.process).\n", "\n", "We want `entityVal`." ] }, { "cell_type": "code", "execution_count": 18, "id": "4ef21376-d2e7-49ff-b03f-cd2734162b60", "metadata": {}, "outputs": [], "source": [ "occurrences = NE.getSetData().entityVal[(\"elohim\", \"div\")]" ] }, { "cell_type": "code", "execution_count": 19, "id": "13180a3e-cd40-4b65-a940-7c3c5246ba78", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(occurrences)" ] }, { "cell_type": "code", "execution_count": 20, "id": "49ebd58f-0f9f-4b9d-a954-c42c1bd83858", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "215" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(occurrences)" ] }, { "cell_type": "markdown", "id": "87c41a64-59bd-4162-880c-42cd805a7645", "metadata": {}, "source": [ "Each occurrence is a tuple of slots.\n", "In general, occurrences of entities can consist of multiple words, hence *tuples* of slots and not\n", "individual slot.\n", "But in our case, the entities *are* single words, so we can get the words as follows:" ] }, { "cell_type": "code", "execution_count": 21, "id": "2ea39684-95ce-4639-ba3f-357c4f6d630d", "metadata": {}, "outputs": [], "source": [ "currentWords = {occ[0] for occ in occurrences}" ] }, { "cell_type": "markdown", "id": "08b49284-9296-4075-9965-da4c3f4d6091", "metadata": {}, "source": [ "Finally, we can detect which entities we have deleted:" ] }, { "cell_type": "code", "execution_count": 22, "id": "24c1928e-b730-4345-aa65-ffb36fb13915", "metadata": {}, "outputs": [], "source": [ "deletedWords = set(words) - currentWords" ] }, { "cell_type": "code", "execution_count": 23, "id": "240dc99d-38d7-43e0-9f3a-3a63229f27a2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{4, 26, 28705, 28739}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deletedWords" ] }, { "cell_type": "markdown", "id": "c8aa8292-e972-4a5b-af16-98e3124f3ba8", "metadata": {}, "source": [ "That looks an awful lot like the first and last two occurrences of Elohim in Genesis.\n", "\n", "We can show them in the usual Text-Fabric way:" ] }, { "cell_type": "code", "execution_count": 24, "id": "97e5afa9-0de2-4320-ba77-431cb1b97949", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
npword
1Genesis 1:1אֱלֹהִ֑ים
2Genesis 1:2אֱלֹהִ֔ים
3Genesis 50:24אלֹהִ֞ים
4Genesis 50:25אֱלֹהִים֙
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.table([(w,) for w in sorted(deletedWords)])" ] }, { "cell_type": "markdown", "id": "e137c99d-1c92-4b60-b597-3d86a155e881", "metadata": {}, "source": [ "We can also show these verses in the annotator view:" ] }, { "cell_type": "code", "execution_count": 26, "id": "dfbacddd-550d-4930-ad44-6d19d8f3c121", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 verses\n" ] }, { "data": { "text/html": [ "
Genesis 1:1 בראשׁית ברא אלהים את השׁמים ואת הארץ׃
Genesis 1:2 והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃
Genesis 50:24 ויאמר יוסף אל־אחיו אנכי מת ואלהים פקד יפקד אתכם והעלה אתכם מן־הארץ הזאת אל־הארץ אשׁר נשׁבע לאברהם ליצחק וליעקב׃
Genesis 50:25 וישׁבע יוסף את־בני ישׂראל לאמר פקד יפקד אלהים אתכם והעלתם את־עצמתי מזה׃
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "L = A.api.L\n", "verses = {L.u(w, otype=\"verse\")[0] for w in deletedWords}\n", "content = NE.filterContent(buckets=verses)\n", "NE.showContent(content)" ] }, { "cell_type": "markdown", "id": "f9e3878d-be9e-4cc6-ae36-5c1bc28058e5", "metadata": {}, "source": [ "No entity is underlined, and in this case absence of evidence is evidence of absence.\n", "\n", "Because if we show a verse with entities, we'll see them:" ] }, { "cell_type": "code", "execution_count": 27, "id": "6208a013-4565-45de-856e-7f9a05c02163", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5 verses\n" ] }, { "data": { "text/html": [ "
Genesis 1:1 בראשׁית ברא אלהים את השׁמים ואת הארץ׃
Genesis 1:2 והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃
Genesis 1:3 ויאמר 1elohim div 215אלהים 1יהי אור ויהי־אור׃
Genesis 1:4 וירא 1elohim div 215אלהים 1את־האור כי־טוב ויבדל 1elohim div 215אלהים 1בין האור ובין החשׁך׃
Genesis 1:5 ויקרא 1elohim div 215אלהים׀ 1לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vInit = {A.nodeFromSectionStr(f\"Genesis 1:{v}\") for v in range(1, 6)}\n", "content = NE.filterContent(buckets=vInit, qTokens=None)\n", "NE.showContent(content)" ] }, { "cell_type": "markdown", "id": "16902e0e-09c9-4838-955a-44abc11193f2", "metadata": {}, "source": [ "# Conclusion\n", "\n", "You have two interfaces to annotate named entities in your corpus, and both interfaces work\n", "with the same data.\n", "\n", "The browser interface is good for mass-marking occurrences of single entities, and also\n", "for fine-tuning by making small sets of exceptions.\n", "\n", "However, if you need to treat broad classes of occurrences of the same name in different ways,\n", "it is better to go over to a Jupyter Notebook and mark the entities programmatically.\n", "\n", "Then you can still return to the browser interface and make individual exceptions.\n", "\n", "What we did not show here is that in the programmatic interface you can also assign lots of different entities\n", "much more quickly than by hand in the browser interface." ] }, { "cell_type": "code", "execution_count": null, "id": "56011a45-7569-474e-abad-af351c2d19de", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" } }, "nbformat": 4, "nbformat_minor": 5 }