{ "cells": [ { "cell_type": "markdown", "id": "1cf27c95-0b45-4d97-a62d-9950654eb386", "metadata": {}, "source": [ "# Some corpus statistics (Nestle1904LFT)\n", "\n", "**Work in progress!**" ] }, { "cell_type": "markdown", "id": "1495a021-daa1-4c2e-80d5-ab7d2d75bc3f", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Table of content \n", "* 1 - Introduction\n", "* 2 - Load Text-Fabric app and data\n", "* 3 - Performing the queries\n", " * 3.1 - The 25 most frequent words in the corpus\n", " * 3.2 - Frequency of characters in corpus\n", " * 3.3 - Some stats on node types \n", " * 3.4 - The available text formats \n", " * 3.5 - List of feature frequencies \n", " * 3.6 - Frequency list of punctuations\n", " * 3.7 - Node number ranges\n", " * 3.8 - Count the objects per type\n", " * 3.9 - Obtain meta data for a feature" ] }, { "cell_type": "markdown", "id": "e6830070-1e97-4bdf-aa0c-5eda4e624a84", "metadata": {}, "source": [ "# 1 - Introduction \n", "##### [Back to TOC](#TOC)\n", "\n", "This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed. " ] }, { "cell_type": "markdown", "id": "a1b900e2-995f-4f36-ad74-d821092ca02c", "metadata": {}, "source": [ "# 2 - Load Text-Fabric app and data \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 1, "id": "6bd6c621-361d-487f-a8df-c27fb1ec9de2", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "0071a0db-916c-4357-88bd-6b3255af0764", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Loading the Text-Fabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 3, "id": "ed76db5d-5463-4bf1-99ca-7f14b3a0f277", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The requested app is not available offline\n", "\t~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app not found\n" ] }, { "data": { "text/html": [ "Status: latest release online v0.6 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The requested data is not available offline\n", "\t~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6 not found\n" ] }, { "data": { "text/html": [ "Status: latest release online v0.6 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.21s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 2.31s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.56s T wordtranslit from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.48s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.59s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.49s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.61s T unicode from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.56s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.46s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.59s T wordunacc from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.59s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | | 0.06s C __levels__ from otype, oslots, otext\n", " | | 1.79s C __order__ from otype, oslots, __levels__\n", " | | 0.07s C __rank__ from otype, __order__\n", " | | 3.35s C __levUp__ from otype, oslots, __rank__\n", " | | 1.94s C __levDown__ from otype, __levUp__, __rank__\n", " | | 0.21s C __characters__ from otext\n", " | | 0.92s C __boundary__ from otype, oslots, __rank__\n", " | | 0.04s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n", " | | 0.23s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse\n", " | 0.43s T booknumber from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.51s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.48s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.33s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.57s T containedclause from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.42s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.57s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.46s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.03s T headverse from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.32s T junction from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.57s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.52s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.53s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.41s T markafter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.41s T markbefore from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.41s T markorder from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.45s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.44s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.52s T morph from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.53s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.48s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.49s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.43s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.44s T punctuation from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.65s T ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.67s T reference from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.49s T roleclausedistance from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.47s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.50s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.51s T sp_full from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.53s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.45s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.44s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.46s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.45s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.40s T wgclass from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.35s T wglevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.43s T wgnum from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.36s T wgrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.35s T wgrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.41s T wgrule from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.33s T wgtype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.52s T wordlevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.49s T wordrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n", " | 0.51s T wordrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6\n" ] }, { "data": { "text/html": [ "\n", " TF: TF API 12.1.5, tonyjurg/Nestle1904LFT/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904LFT 0.6, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book275102.93100
chapter260529.92100
verse794317.35100
sentence801117.20100
wg1054306.85524
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904 (Low Fat Tree)\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " ✅ Characters (eg. punctuations) following the word\n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " ✅ Book name (in English language)\n", "\n", "
\n", "\n", "
\n", "
\n", "booknumber\n", "
\n", "
int
\n", "\n", " ✅ NT book number (Matthew=1, Mark=2, ..., Revelation=27)\n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " ✅ Book name (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " ✅ Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " ✅ Chapter number inside book\n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " ✅ Clause type details (e.g. Verbless, Minor)\n", "\n", "
\n", "\n", "
\n", "
\n", "containedclause\n", "
\n", "
str
\n", "\n", " 🆗 Contained clause (WG number)\n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " ✅ Degree (e.g. Comparitative, Superlative)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " ✅ English gloss\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " ✅ Gramatical gender (Masculine, Feminine, Neuter)\n", "\n", "
\n", "\n", "
\n", "
\n", "headverse\n", "
\n", "
str
\n", "\n", " ✅ Start verse number of a sentence\n", "\n", "
\n", "\n", "
\n", "
\n", "junction\n", "
\n", "
str
\n", "\n", " ✅ Junction data related to a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " ✅ Lexeme (lemma)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " ✅ Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " ✅ Lauw-Nida lexical classification (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "markafter\n", "
\n", "
str
\n", "\n", " 🆗 Text critical marker after word\n", "\n", "
\n", "\n", "
\n", "
\n", "markbefore\n", "
\n", "
str
\n", "\n", " 🆗 Text critical marker before word\n", "\n", "
\n", "\n", "
\n", "
\n", "markorder\n", "
\n", "
str
\n", "\n", "  Order of punctuation and text critical marker\n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " ✅ Monad (smallest token matching word order in the corpus)\n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " ✅ Gramatical mood of the verb (passive, etc)\n", "\n", "
\n", "\n", "
\n", "
\n", "morph\n", "
\n", "
str
\n", "\n", " ✅ Morphological tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " ✅ Node ID (as in the XML source data)\n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " ✅ Surface word with accents normalized and trailing punctuations removed\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " ✅ Gramatical number (Singular, Plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " ✅ Gramatical number of the verb (e.g. singular, plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " ✅ Gramatical person of the verb (first, second, third)\n", "\n", "
\n", "\n", "
\n", "
\n", "punctuation\n", "
\n", "
str
\n", "\n", " ✅ Punctuation after word\n", "\n", "
\n", "\n", "
\n", "
\n", "ref\n", "
\n", "
str
\n", "\n", " ✅ Value of the ref ID (taken from XML sourcedata)\n", "\n", "
\n", "\n", "
\n", "
\n", "reference\n", "
\n", "
str
\n", "\n", " ✅ Reference (to nodeID in XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "roleclausedistance\n", "
\n", "
str
\n", "\n", " ⚠️ Distance to the wordgroup defining the syntactical role of this word\n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " ✅ Sentence number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " ✅ Part of Speech (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp_full\n", "
\n", "
str
\n", "\n", " ✅ Part of Speech (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " ✅ Strongs number\n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " 🆗 Subject reference (to nodeID in XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " ✅ Gramatical tense of the verb (e.g. Present, Aorist)\n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " ✅ Gramatical type of noun or pronoun (e.g. Common, Personal)\n", "\n", "
\n", "\n", "
\n", "
\n", "unicode\n", "
\n", "
str
\n", "\n", " ✅ Word as it apears in the text in Unicode (incl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " ✅ Verse number inside chapter\n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " ✅ Gramatical voice of the verb (e.g. active,passive)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgclass\n", "
\n", "
str
\n", "\n", " ✅ Class of the wordgroup (e.g. cl, np, vp)\n", "\n", "
\n", "\n", "
\n", "
\n", "wglevel\n", "
\n", "
int
\n", "\n", " 🆗 Number of the parent wordgroups for a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "wgnum\n", "
\n", "
int
\n", "\n", " ✅ Wordgroup number (counted per book)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrole\n", "
\n", "
str
\n", "\n", " ✅ Syntactical role of the wordgroup (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrolelong\n", "
\n", "
str
\n", "\n", " ✅ Syntactical role of the wordgroup (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrule\n", "
\n", "
str
\n", "\n", " ✅ Wordgroup rule information (e.g. Np-Appos, ClCl2, PrepNp)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgtype\n", "
\n", "
str
\n", "\n", " ✅ Wordgroup type details (e.g. group, apposition)\n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " ✅ Word as it appears in the text (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordlevel\n", "
\n", "
str
\n", "\n", " 🆗 Number of the parent wordgroups for a word\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrole\n", "
\n", "
str
\n", "\n", " ✅ Syntactical role of the word (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrolelong\n", "
\n", "
str
\n", "\n", " ✅ Syntactical role of the word (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordtranslit\n", "
\n", "
str
\n", "\n", " 🆗 Transliteration of the text (in latin letters, excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordunacc\n", "
\n", "
str
\n", "\n", " ✅ Word without accents (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: tonyjurg/Nestle1904LFT
  3. appPath:C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
  4. commit: no value
  5. css: ''
  6. dataDisplay:
    • excludedFeatures:
      • orig_order
      • verse
      • book
      • chapter
    • noneValues:
      • none
      • unknown
      • no value
      • NA
      • ''
    • showVerseInTuple: 0
    • textFormat: text-orig-full
  7. docs:
    • docBase: https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/
    • docPage: about
    • docRoot: https://github.com/tonyjurg/Nestle1904LFT
    • featureBase:https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/features/<feature>.md
  8. interfaceDefaults: {fmt: layout-orig-full}
  9. isCompatible: True
  10. local: no value
  11. localDir:C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
  12. provenanceSpec:
    • corpus: Nestle 1904 (Low Fat Tree)
    • doi: notyet
    • org: tonyjurg
    • relative: /tf
    • repo: Nestle1904LFT
    • repro: Nestle1904LFT
    • version: 0.6
    • webBase: https://learner.bible/text/show_text/nestle1904/
    • webHint: Show this on the Bible Online Learner website
    • webLang: en
    • webUrl:https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
    • webUrlLex: {webBase}/word?version={version}&id=<lid>
  13. release: no value
  14. typeDisplay:
    • book:
      • condense: True
      • hidden: True
      • label: {book}
      • style: ''
    • chapter:
      • condense: True
      • hidden: True
      • label: {chapter}
      • style: ''
    • sentence:
      • hidden: 0
      • label: #{sentence} (start: {book} {chapter}:{headverse})
      • style: ''
    • verse:
      • condense: True
      • excludedFeatures: chapter verse
      • label: {book} {chapter}:{verse}
      • style: ''
    • wg:
      • hidden: 0
      • label:#{wgnum}: {wgtype} {wgclass} {clausetype} {wgrole} {wgrule} {junction}
      • style: ''
    • word:
      • base: True
      • features: lemma
      • featuresBare: gloss
      • surpress: chapter verse
  15. writing: grc
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the N1904 app and data\n", "N1904 = use (\"tonyjurg/Nestle1904LFT\", version=\"0.6\", hoist=globals())" ] }, { "cell_type": "code", "execution_count": 6, "id": "d5da5d1a-6827-49b3-ad37-7ca29ba59b45", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)\n", "N1904.dh(N1904.getCss())" ] }, { "cell_type": "code", "execution_count": 12, "id": "80c5a250-0785-46ed-bd51-c8e3e29205f6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set default view in a way to limit noise as much as possible.\n", "N1904.displaySetup(condensed=True, multiFeatures=False, queryFeatures=False)" ] }, { "cell_type": "markdown", "id": "58ef1678-a19d-4c0c-80f3-84f8471a90e2", "metadata": { "tags": [] }, "source": [ "# 3 - Performing the queries \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "b59c83bd-329d-4820-8bcc-ca92e1c55f6d", "metadata": {}, "source": [ "## 3.1 - The 25 most frequent words in the corpus\n", "##### [Back to TOC](#TOC)\n", "\n", "The method [`freqList`](https://annotation.github.io/text-fabric/tf/core/nodefeature.html#tf.core.nodefeature.NodeFeature.freqList) returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first." ] }, { "cell_type": "code", "execution_count": 4, "id": "1d4b1b93-08e5-41f4-a587-66e444a3e271", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Amount\tword\n", "8545\tκαὶ\n", "2769\tὁ\n", "2684\tἐν\n", "2620\tδὲ\n", "2497\tτοῦ\n", "1755\tεἰς\n", "1658\tτὸ\n", "1556\tτὸν\n", "1518\tτὴν\n", "1411\tαὐτοῦ\n", "1300\tτῆς\n", "1281\tὅτι\n", "1221\tτῷ\n", "1201\tτῶν\n", "1069\tοἱ\n", "941\tἡ\n", "921\tγὰρ\n", "902\tμὴ\n", "859\tτῇ\n", "849\tαὐτῷ\n", "817\tτὰ\n", "767\tοὐκ\n", "722\tτοὺς\n", "689\tΘεοῦ\n", "670\tπρὸς\n" ] } ], "source": [ "print(\"Amount\\tword\")\n", "for (w, amount) in F.word.freqList(\"word\")[0:25]:\n", " print(f\"{amount}\\t{w}\")" ] }, { "cell_type": "markdown", "id": "211b2bde-002b-4243-87c9-4bd850868354", "metadata": { "jupyter": { "outputs_hidden": true }, "tags": [] }, "source": [ "## 3.2 - Frequency of characters in corpus \n", "##### [Back to TOC](#TOC)\n", "\n", "This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table. \n", "\n", "Note the first line of the output is 'Format: text-orig-full'. This " ] }, { "cell_type": "code", "execution_count": 5, "id": "b8e8ce2d-43db-48dd-ace9-2156c7046692", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Format: text-critical\n", "╒═════════════╤═════════════╕\n", "│ character │ frequency │\n", "╞═════════════╪═════════════╡\n", "│ ν │ 56230 │\n", "├─────────────┼─────────────┤\n", "│ α │ 51892 │\n", "├─────────────┼─────────────┤\n", "│ τ │ 50599 │\n", "├─────────────┼─────────────┤\n", "│ ο │ 45151 │\n", "├─────────────┼─────────────┤\n", "│ ε │ 38597 │\n", "├─────────────┼─────────────┤\n", "│ ς │ 27090 │\n", "├─────────────┼─────────────┤\n", "│ ι │ 26131 │\n", "├─────────────┼─────────────┤\n", "│ σ │ 24095 │\n", "├─────────────┼─────────────┤\n", "│ ρ │ 22871 │\n", "├─────────────┼─────────────┤\n", "│ κ │ 22630 │\n", "├─────────────┼─────────────┤\n", "│ π │ 20308 │\n", "├─────────────┼─────────────┤\n", "│ μ │ 19218 │\n", "├─────────────┼─────────────┤\n", "│ λ │ 18228 │\n", "├─────────────┼─────────────┤\n", "│ δ │ 12476 │\n", "├─────────────┼─────────────┤\n", "│ ἐ │ 12116 │\n", "╘═════════════╧═════════════╛\n" ] }, { "data": { "text/markdown": [ "**Warning: table truncated!**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: text-normalized\n", "╒═════════════╤═════════════╕\n", "│ character │ frequency │\n", "╞═════════════╪═════════════╡\n", "│ │ 137779 │\n", "├─────────────┼─────────────┤\n", "│ ν │ 56230 │\n", "├─────────────┼─────────────┤\n", "│ α │ 52127 │\n", "├─────────────┼─────────────┤\n", "│ τ │ 50599 │\n", "├─────────────┼─────────────┤\n", "│ ο │ 45516 │\n", "├─────────────┼─────────────┤\n", "│ ε │ 38807 │\n", "├─────────────┼─────────────┤\n", "│ ς │ 27090 │\n", "├─────────────┼─────────────┤\n", "│ ι │ 26404 │\n", "├─────────────┼─────────────┤\n", "│ σ │ 24095 │\n", "├─────────────┼─────────────┤\n", "│ ρ │ 22871 │\n", "├─────────────┼─────────────┤\n", "│ κ │ 22630 │\n", "├─────────────┼─────────────┤\n", "│ ί │ 21518 │\n", "├─────────────┼─────────────┤\n", "│ π │ 20308 │\n", "├─────────────┼─────────────┤\n", "│ μ │ 19218 │\n", "├─────────────┼─────────────┤\n", "│ λ │ 18228 │\n", "╘═════════════╧═════════════╛\n" ] }, { "data": { "text/markdown": [ "**Warning: table truncated!**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: text-orig-full\n", "╒═════════════╤═════════════╕\n", "│ character │ frequency │\n", "╞═════════════╪═════════════╡\n", "│ │ 137779 │\n", "├─────────────┼─────────────┤\n", "│ ν │ 56230 │\n", "├─────────────┼─────────────┤\n", "│ α │ 51892 │\n", "├─────────────┼─────────────┤\n", "│ τ │ 50599 │\n", "├─────────────┼─────────────┤\n", "│ ο │ 45151 │\n", "├─────────────┼─────────────┤\n", "│ ε │ 38597 │\n", "├─────────────┼─────────────┤\n", "│ ς │ 27090 │\n", "├─────────────┼─────────────┤\n", "│ ι │ 26131 │\n", "├─────────────┼─────────────┤\n", "│ σ │ 24095 │\n", "├─────────────┼─────────────┤\n", "│ ρ │ 22871 │\n", "├─────────────┼─────────────┤\n", "│ κ │ 22630 │\n", "├─────────────┼─────────────┤\n", "│ π │ 20308 │\n", "├─────────────┼─────────────┤\n", "│ μ │ 19218 │\n", "├─────────────┼─────────────┤\n", "│ λ │ 18228 │\n", "├─────────────┼─────────────┤\n", "│ δ │ 12476 │\n", "╘═════════════╧═════════════╛\n" ] }, { "data": { "text/markdown": [ "**Warning: table truncated!**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: text-transliterated\n", "╒═════════════╤═════════════╕\n", "│ character │ frequency │\n", "╞═════════════╪═════════════╡\n", "│ │ 137779 │\n", "├─────────────┼─────────────┤\n", "│ e │ 93371 │\n", "├─────────────┼─────────────┤\n", "│ o │ 87008 │\n", "├─────────────┼─────────────┤\n", "│ a │ 75119 │\n", "├─────────────┼─────────────┤\n", "│ i │ 62778 │\n", "├─────────────┼─────────────┤\n", "│ t │ 60011 │\n", "├─────────────┼─────────────┤\n", "│ n │ 56230 │\n", "├─────────────┼─────────────┤\n", "│ s │ 52132 │\n", "├─────────────┼─────────────┤\n", "│ u │ 39287 │\n", "├─────────────┼─────────────┤\n", "│ k │ 27300 │\n", "├─────────────┼─────────────┤\n", "│ p │ 25081 │\n", "├─────────────┼─────────────┤\n", "│ r │ 22871 │\n", "├─────────────┼─────────────┤\n", "│ h │ 20033 │\n", "├─────────────┼─────────────┤\n", "│ m │ 19218 │\n", "├─────────────┼─────────────┤\n", "│ l │ 18228 │\n", "╘═════════════╧═════════════╛\n" ] }, { "data": { "text/markdown": [ "**Warning: table truncated!**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Format: text-unaccented\n", "╒═════════════╤═════════════╕\n", "│ character │ frequency │\n", "╞═════════════╪═════════════╡\n", "│ │ 137779 │\n", "├─────────────┼─────────────┤\n", "│ α │ 75119 │\n", "├─────────────┼─────────────┤\n", "│ ε │ 66656 │\n", "├─────────────┼─────────────┤\n", "│ ο │ 65731 │\n", "├─────────────┼─────────────┤\n", "│ ι │ 62834 │\n", "├─────────────┼─────────────┤\n", "│ ν │ 56230 │\n", "├─────────────┼─────────────┤\n", "│ τ │ 50599 │\n", "├─────────────┼─────────────┤\n", "│ υ │ 39287 │\n", "├─────────────┼─────────────┤\n", "│ ς │ 27090 │\n", "├─────────────┼─────────────┤\n", "│ η │ 26715 │\n", "├─────────────┼─────────────┤\n", "│ σ │ 24095 │\n", "├─────────────┼─────────────┤\n", "│ ρ │ 23046 │\n", "├─────────────┼─────────────┤\n", "│ κ │ 22630 │\n", "├─────────────┼─────────────┤\n", "│ ω │ 21277 │\n", "├─────────────┼─────────────┤\n", "│ π │ 20308 │\n", "╘═════════════╧═════════════╛\n" ] }, { "data": { "text/markdown": [ "**Warning: table truncated!**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Library to format table\n", "from tabulate import tabulate\n", "\n", "# The following API call will result in a Python dictionary structure\n", "FrequencyDictionary=C.characters.data\n", "\n", "# Present the results\n", "KeyList = list(FrequencyDictionary.keys())\n", "for Key in KeyList:\n", " print('Format: ',Key)\n", " # 'key' refers to the pre-defined formats the text will be displayed\n", " FrequencyList=FrequencyDictionary[Key]\n", " SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)\n", " \n", " # In this example the table will be truncated to the first 15 entries\n", " max_rows = 15 # Set your desired number of rows here\n", " TruncatedTable = SortedFrequencyList[:max_rows]\n", " \n", " headers = [\"character\", \"frequency\"]\n", " print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))\n", " \n", " # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type\n", " N1904.dm(\"**Warning: table truncated!**\")" ] }, { "cell_type": "markdown", "id": "75627859-1d9c-4d99-9020-d2302f6de408", "metadata": {}, "source": [ "## 3.3 - Some stats on node types \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 8, "id": "b5ce40f1-9a22-444f-955a-c5545797a056", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(('book', 5102.925925925926, 137780, 137806),\n", " ('chapter', 529.9192307692308, 137807, 138066),\n", " ('verse', 17.345965000629484, 146078, 154020),\n", " ('sentence', 17.198726750717764, 138067, 146077),\n", " ('wg', 7.583849727185382, 154021, 267467),\n", " ('word', 1, 1, 137779))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "id": "f6ad9acc-92e3-47b9-bfaf-8c06ed33ada4", "metadata": { "tags": [] }, "source": [ "## 3.4 - The available text formats \n", "##### [Back to TOC](#TOC)\n", "\n", "Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also [module tf.advanced.options\n", "Display Settings](https://annotation.github.io/text-fabric/tf/advanced/options.html)." ] }, { "cell_type": "code", "execution_count": 8, "id": "97137d58-68cb-4383-a545-5668e603493f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "format | level | template\n", "--- | --- | ---\n", "`text-critical` | **word** | `{unicode} `\n", "`text-normalized` | **word** | `{normalized}{after}`\n", "`text-orig-full` | **word** | `{word}{after}`\n", "`text-transliterated` | **word** | `{wordtranslit}{after}`\n", "`text-unaccented` | **word** | `{wordunacc}{after}`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "N1904.showFormats()" ] }, { "cell_type": "markdown", "id": "1bae482a-9abb-4280-a52c-b3011037fded", "metadata": {}, "source": [ "The same result (although formatted different) can be obtained by the following call:" ] }, { "cell_type": "code", "execution_count": 9, "id": "acaaf356-eeae-4101-b5ef-090607dca5fc", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'text-critical': 'word',\n", " 'text-normalized': 'word',\n", " 'text-orig-full': 'word',\n", " 'text-transliterated': 'word',\n", " 'text-unaccented': 'word'}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.formats" ] }, { "cell_type": "markdown", "id": "76294b50-192f-47e0-95c2-09a1ca79fe17", "metadata": {}, "source": [ "Note that this data originates from file `otext.tf`:\n", "\n", "> \n", "```\n", "@config\n", "...\n", "@fmt:text-orig-full={word}{after}\n", "...\n", "```\n" ] }, { "cell_type": "markdown", "id": "d23c6817", "metadata": {}, "source": [ "## 3.5 - List of feature frequencies \n", "##### [Back to TOC](#TOC)\n", "\n", "This code generates a lot of output! For that reason we will cut it off after 5 lines per feature." ] }, { "cell_type": "code", "execution_count": 10, "id": "75b2827e-81e1-4e28-a46c-bd50bc56a5aa", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature: after \n", "\n", "\t value\t frequency\n", "\t \t 119270\n", "\t , \t 9462\n", "\t . \t 5717\n", "\t · \t 2359\n", "\t ; \t 971\n", "\n", "\n", "Feature: appos \n", "\n", "\t value\t frequency\n", "\t \t 100949\n", "\t group \t 9699\n", "\t apposition \t 2799\n", "\n", "\n", "Feature: book \n", "\n", "\t value\t frequency\n", "\t Luke \t 19457\n", "\t Acts \t 18394\n", "\t Matthew \t 18300\n", "\t John \t 15644\n", "\t Mark \t 11278\n", "\n", "\n", "Feature: booknumber \n", "\n", "\t value\t frequency\n", "\t 3 \t 19457\n", "\t 5 \t 18394\n", "\t 1 \t 18300\n", "\t 4 \t 15644\n", "\t 2 \t 11278\n", "\n", "\n", "Feature: bookshort \n", "\n", "\t value\t frequency\n", "\t Luke \t 19457\n", "\t Acts \t 18394\n", "\t Matt \t 18300\n", "\t John \t 15644\n", "\t Mark \t 11278\n", "\n", "\n", "Feature: case \n", "\n", "\t value\t frequency\n", "\t \t 58261\n", "\t nominative \t 24197\n", "\t accusative \t 23031\n", "\t genitive \t 19515\n", "\t dative \t 12126\n", "\n", "\n", "Feature: chapter \n", "\n", "\t value\t frequency\n", "\t 1 \t 12868\n", "\t 2 \t 10923\n", "\t 3 \t 9652\n", "\t 4 \t 9631\n", "\t 5 \t 8788\n", "\n", "\n", "Feature: clausetype \n", "\n", "\t value\t frequency\n", "\t \t 110679\n", "\t VerbElided \t 1009\n", "\t Verbless \t 929\n", "\t Minor \t 830\n", "\n", "\n", "Feature: containedclause \n", "\n", "\t value\t frequency\n", "\t 2 \t 338\n", "\t 2036 \t 167\n", "\t 97 \t 82\n", "\t 172 \t 81\n", "\t 1083 \t 79\n", "\n", "\n", "Feature: degree \n", "\n", "\t value\t frequency\n", "\t \t 137266\n", "\t comparative \t 313\n", "\t superlative \t 200\n", "\n", "\n", "Feature: gloss \n", "\n", "\t value\t frequency\n", "\t the \t 9857\n", "\t and \t 6212\n", "\t - \t 5496\n", "\t in \t 2320\n", "\t And \t 2218\n", "\n", "\n", "Feature: gn \n", "\n", "\t value\t frequency\n", "\t \t 63804\n", "\t masculine \t 41486\n", "\t feminine \t 18736\n", "\t neuter \t 13753\n", "\n", "\n", "Feature: junction \n", "\n", "\t value\t frequency\n", "\t \t 93392\n", "\t coordinate \t 9178\n", "\t subordinate \t 8491\n", "\t apposition \t 2386\n", "\n", "\n", "Feature: lemma \n", "\n", "\t value\t frequency\n", "\t ὁ \t 19783\n", "\t καί \t 8978\n", "\t αὐτός \t 5561\n", "\t σύ \t 2892\n", "\t δέ \t 2787\n", "\n", "\n", "Feature: lex_dom \n", "\n", "\t value\t frequency\n", "\t 092004 \t 26322\n", "\t \t 10487\n", "\t 089017 \t 4370\n", "\t 093001 \t 3672\n", "\t 033006 \t 3225\n", "\n", "\n", "Feature: ln \n", "\n", "\t value\t frequency\n", "\t 92.24 \t 19781\n", "\t \t 10488\n", "\t 92.11 \t 4718\n", "\t 89.92 \t 2903\n", "\t 89.87 \t 2756\n", "\n", "\n", "Feature: markafter \n", "\n", "\t value\t frequency\n", "\t \t 137728\n", "\t — \t 31\n", "\t ) \t 11\n", "\t ]] \t 7\n", "\t ( \t 1\n", "\n", "\n", "Feature: markbefore \n", "\n", "\t value\t frequency\n", "\t \t 137745\n", "\t — \t 16\n", "\t ( \t 10\n", "\t [[ \t 7\n", "\t [ \t 1\n", "\n", "\n", "Feature: markorder \n", "\n", "\t value\t frequency\n", "\t \t 137694\n", "\t 0 \t 34\n", "\t 3 \t 32\n", "\t 2 \t 10\n", "\t 1 \t 9\n", "\n", "\n", "Feature: monad \n", "\n", "\t value\t frequency\n", "\t 1 \t 1\n", "\t 2 \t 1\n", "\t 3 \t 1\n", "\t 4 \t 1\n", "\t 5 \t 1\n", "\n", "\n", "Feature: mood \n", "\n", "\t value\t frequency\n", "\t \t 109422\n", "\t indicative \t 15617\n", "\t participle \t 6653\n", "\t infinitive \t 2285\n", "\t imperative \t 1877\n", "\n", "\n", "Feature: morph \n", "\n", "\t value\t frequency\n", "\t CONJ \t 16316\n", "\t PREP \t 10568\n", "\t ADV \t 3808\n", "\t N-NSM \t 3475\n", "\t N-GSM \t 2935\n", "\n", "\n", "Feature: nodeID \n", "\n", "\t value\t frequency\n", "\t \t 52046\n", "\t common \t 14186\n", "\t personal \t 6040\n", "\t proper \t 2192\n", "\t relative \t 885\n", "\n", "\n", "Feature: normalized \n", "\n", "\t value\t frequency\n", "\t καί \t 8576\n", "\t ὁ \t 2769\n", "\t δέ \t 2764\n", "\t ἐν \t 2684\n", "\t τοῦ \t 2497\n", "\n", "\n", "Feature: nu \n", "\n", "\t value\t frequency\n", "\t singular \t 69846\n", "\t \t 38842\n", "\t plural \t 29091\n", "\n", "\n", "Feature: number \n", "\n", "\t value\t frequency\n", "\t singular \t 69846\n", "\t \t 38842\n", "\t plural \t 29091\n", "\n", "\n", "Feature: orig_order \n", "\n", "\t value\t frequency\n", "\t 1 \t 1\n", "\t 2 \t 1\n", "\t 3 \t 1\n", "\t 4 \t 1\n", "\t 5 \t 1\n", "\n", "\n", "Feature: person \n", "\n", "\t value\t frequency\n", "\t \t 118360\n", "\t third \t 12747\n", "\t second \t 3729\n", "\t first \t 2943\n", "\n", "\n", "Feature: punctuation \n", "\n", "\t value\t frequency\n", "\t \t 119270\n", "\t , \t 9462\n", "\t . \t 5717\n", "\t · \t 2359\n", "\t ; \t 971\n", "\n", "\n", "Feature: ref \n", "\n", "\t value\t frequency\n", "\t 1CO 10:1!1 \t 1\n", "\t 1CO 10:1!10 \t 1\n", "\t 1CO 10:1!11 \t 1\n", "\t 1CO 10:1!12 \t 1\n", "\t 1CO 10:1!13 \t 1\n", "\n", "\n", "Feature: roleclausedistance \n", "\n", "\t value\t frequency\n", "\t 0 \t 56129\n", "\t 1 \t 37597\n", "\t 2 \t 22297\n", "\t 3 \t 12084\n", "\t 4 \t 5277\n", "\n", "\n", "Feature: sentence \n", "\n", "\t value\t frequency\n", "\t 3 \t 1103\n", "\t 4 \t 960\n", "\t 1 \t 810\n", "\t 5 \t 747\n", "\t 6 \t 680\n", "\n", "\n", "Feature: sp \n", "\n", "\t value\t frequency\n", "\t noun \t 28455\n", "\t verb \t 28357\n", "\t det \t 19786\n", "\t conj \t 18227\n", "\t pron \t 16177\n", "\n", "\n", "Feature: sp_full \n", "\n", "\t value\t frequency\n", "\t Noun \t 28455\n", "\t Verb \t 28357\n", "\t Determiner \t 19786\n", "\t Conjunction \t 18227\n", "\t Pronoun \t 16177\n", "\n", "\n", "Feature: strongs \n", "\n", "\t value\t frequency\n", "\t 3588 \t 19783\n", "\t 2532 \t 8978\n", "\t 846 \t 5561\n", "\t 4771 \t 2892\n", "\t 1161 \t 2787\n", "\n", "\n", "Feature: subj_ref \n", "\n", "\t value\t frequency\n", "\t \t 121204\n", "\t n46003022002 \t 172\n", "\t n66001009002 \t 131\n", "\t n45001001001 \t 104\n", "\t n47010001004 \t 104\n", "\n", "\n", "Feature: tense \n", "\n", "\t value\t frequency\n", "\t \t 109422\n", "\t aorist \t 11803\n", "\t present \t 11579\n", "\t imperfect \t 1689\n", "\t future \t 1626\n", "\n", "\n", "Feature: type \n", "\n", "\t value\t frequency\n", "\t \t 93321\n", "\t common \t 23644\n", "\t personal \t 11521\n", "\t proper \t 4639\n", "\t demonstrative \t 1722\n", "\n", "\n", "Feature: unicode \n", "\n", "\t value\t frequency\n", "\t καὶ \t 8541\n", "\t ὁ \t 2768\n", "\t ἐν \t 2683\n", "\t δὲ \t 2619\n", "\t τοῦ \t 2497\n", "\n", "\n", "Feature: verse \n", "\n", "\t value\t frequency\n", "\t 10 \t 5180\n", "\t 12 \t 5177\n", "\t 1 \t 5064\n", "\t 9 \t 5064\n", "\t 4 \t 5024\n", "\n", "\n", "Feature: voice \n", "\n", "\t value\t frequency\n", "\t \t 109422\n", "\t active \t 20742\n", "\t passive \t 3493\n", "\t middle \t 2408\n", "\t middlepassive \t 1714\n", "\n", "\n", "Feature: wgclass \n", "\n", "\t value\t frequency\n", "\t np \t 33710\n", "\t cl \t 30857\n", "\t cl* \t 16378\n", "\t \t 12760\n", "\t pp \t 11169\n", "\n", "\n", "Feature: wglevel \n", "\n", "\t value\t frequency\n", "\t 5 \t 16862\n", "\t 4 \t 16527\n", "\t 6 \t 15520\n", "\t 7 \t 12163\n", "\t 3 \t 10447\n", "\n", "\n", "Feature: wgnum \n", "\n", "\t value\t frequency\n", "\t 1 \t 27\n", "\t 2 \t 27\n", "\t 3 \t 27\n", "\t 4 \t 27\n", "\t 5 \t 27\n", "\n", "\n", "Feature: wgrole \n", "\n", "\t value\t frequency\n", "\t \t 77251\n", "\t adv \t 16710\n", "\t o \t 9329\n", "\t s \t 6710\n", "\t p \t 1770\n", "\n", "\n", "Feature: wgrolelong \n", "\n", "\t value\t frequency\n", "\t \t 77280\n", "\t Adverbial \t 16710\n", "\t Object \t 9329\n", "\t Subject \t 6710\n", "\t Predicate \t 1770\n", "\n", "\n", "Feature: wgrule \n", "\n", "\t value\t frequency\n", "\t \t 22718\n", "\t DetNP \t 15696\n", "\t PrepNp \t 11044\n", "\t NPofNP \t 6819\n", "\t Conj-CL \t 5571\n", "\n", "\n", "Feature: wgtype \n", "\n", "\t value\t frequency\n", "\t \t 100949\n", "\t group \t 9699\n", "\t apposition \t 2799\n", "\n", "\n", "Feature: word \n", "\n", "\t value\t frequency\n", "\t καὶ \t 8545\n", "\t ὁ \t 2769\n", "\t ἐν \t 2684\n", "\t δὲ \t 2620\n", "\t τοῦ \t 2497\n", "\n", "\n", "Feature: wordlevel \n", "\n", "\t value\t frequency\n", "\t 6 \t 21857\n", "\t 7 \t 20984\n", "\t 5 \t 20538\n", "\t 8 \t 16755\n", "\t 9 \t 12772\n", "\n", "\n", "Feature: wordrole \n", "\n", "\t value\t frequency\n", "\t adv \t 41598\n", "\t v \t 25817\n", "\t s \t 22908\n", "\t o \t 21929\n", "\t \t 9347\n", "\n", "\n", "Feature: wordrolelong \n", "\n", "\t value\t frequency\n", "\t Adverbial \t 41598\n", "\t Verbal \t 25817\n", "\t Subject \t 22908\n", "\t Object \t 21929\n", "\t \t 9347\n", "\n", "\n", "Feature: wordtranslit \n", "\n", "\t value\t frequency\n", "\t kai \t 8576\n", "\t en \t 3152\n", "\t o \t 3149\n", "\t to \t 2885\n", "\t de \t 2769\n", "\n", "\n", "Feature: wordunacc \n", "\n", "\t value\t frequency\n", "\t και \t 8576\n", "\t ο \t 3019\n", "\t δε \t 2764\n", "\t εν \t 2752\n", "\t του \t 2497\n", "\n", "\n" ] } ], "source": [ "FeatureList=Fall()\n", "LinesToPrint=5\n", "for Feature in FeatureList: \n", " if Feature!='otype':\n", " print ('Feature:',Feature,'\\n\\n\\t value\\t frequency')\n", " FeatureFrequenceLists=Fs(Feature).freqList()\n", " PrintedLine=0\n", " for item, freq in FeatureFrequenceLists:\n", " PrintedLine+=1\n", " print ('\\t',item,'\\t',freq)\n", " if PrintedLine==LinesToPrint: break\n", " print ('\\n')" ] }, { "cell_type": "markdown", "id": "cba64820-a3e6-4b40-8a25-e0f95f2fd66e", "metadata": { "tags": [] }, "source": [ "## 3.6 - Frequency list of punctuations \n", "##### [Back to TOC](#TOC)\n", "\n", "Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved." ] }, { "cell_type": "code", "execution_count": 11, "id": "c797fa57-d536-4471-b44d-d3a45653f34a", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ " String | Unicode | Frequency\n", "--- | --- | ---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " ` ` | 32 | 119272 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `,` | 44 | 9441 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `.` | 46 | 5712 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `·` | 183 | 2355 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `;` | 59 | 969 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `—` | 8212 | 30 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "result = F.after.freqList()\n", "N1904.dm(\" String | Unicode | Frequency\\n--- | --- | ---\")\n", "for (string, freq) in result:\n", " # important: string does contain two characters in case of punctuations\n", " frequency=str(freq) #convert it to a string\n", " unicode_value = str(ord(string[0])) #convert it to a string\n", " N1904.dm(\" `{}` | {} | {} \".format(string[0],unicode_value,frequency)) " ] }, { "cell_type": "markdown", "id": "b3cbf04f", "metadata": {}, "source": [ "## 3.7 - Node number ranges \n", "##### [Back to TOC](#TOC)\n", "\n", "The node number ranges are readily available by calling `F.otype.all` which returns a list of all node types. " ] }, { "cell_type": "code", "execution_count": 26, "id": "20dd1920", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "book (137780, 137806)\n", "chapter (137807, 138066)\n", "verse (146078, 154020)\n", "sentence (138067, 146077)\n", "wg (154021, 268899)\n", "word (1, 137779)\n" ] } ], "source": [ "for NodeType in F.otype.all:\n", " print (NodeType, F.otype.sInterval(NodeType))" ] }, { "cell_type": "markdown", "id": "86e62381-0fdd-4e56-8855-11e8c73aec7e", "metadata": {}, "source": [ "## 3.8 - Count the objects per type \n", "##### [Back to TOC](#TOC)\n", "\n", "Using the same API call, we can produce also another list where we are counting the number of nodes for each type." ] }, { "cell_type": "code", "execution_count": 27, "id": "dc4b5cae-9f19-4a42-aa9e-6decf3df4c2f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 27 books\n", " 260 chapters\n", " 7943 verses\n", " 8011 sentences\n", " 114879 wgs\n", " 137779 words\n" ] } ], "source": [ "for otype in F.otype.all:\n", " i = 0\n", " for n in F.otype.s(otype):\n", " i += 1\n", " print (\"{:>7} {}s\".format(i, otype))" ] }, { "cell_type": "code", "execution_count": 7, "id": "c5730f29-e9d8-4483-9493-b31b7efbdafd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
Job:
Ellipsis
\n", "
\n", "
\n", "
Author:
program author
\n", "
\n", "
\n", "
Created:
2023-07-28T23:07:21+02:00
\n", "
\n", "
\n", "
Data:
\n", "
Nestle 1904
\n", "
\n", "
\n", "
version
\n", "
0.5
\n", "
\n", "
\n", "
release
\n", "
none
\n", "
\n", " \n", "
\n", "
DOI
\n", "
no DOI
\n", "
\n", " \n", "
\n", "
Tool:
\n", "
Text-Fabric 11.4.10 10.5281/zenodo.592193
\n", "
\n", "
\n", "
TF App:
\n", "
tonyjurg/Nestle1904LFT on GitHub
\n", "
\n", "
\n", "
commit
\n", " \n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "N1904.showProvenance(...)" ] }, { "cell_type": "markdown", "id": "68b0b53c-fc49-4ad1-8945-5f26dfd818dc", "metadata": {}, "source": [ "## 3.9 - Obtain meta data for a feature \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": null, "id": "6de718fc-1823-413a-9c09-b9f70f014c7a", "metadata": {}, "outputs": [], "source": [ "This can be usefull if you want to process all feature in a script." ] }, { "cell_type": "code", "execution_count": 12, "id": "07370a48-c263-4910-bd1a-bc4e46c73a07", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Availability': 'Creative Commons Attribution 4.0 International (CC BY 4.0)', 'Converter_author': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_execution': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_version': '0.3', 'Convertor_source': 'https://github.com/tonyjurg/Nestle1904LFT/tree/main/tools', 'Data source': 'MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat', 'Editors': 'Eberhard Nestle', 'Name': 'Greek New Testament (Nestle 1904 based on Low Fat Tree)', 'TextFabric version': '11.4.10', 'description': 'Word as it appears in the text (excl. punctuations)', 'valueType': 'str', 'writtenBy': 'Text-Fabric', 'dateWritten': '2023-06-19T15:13:46Z'}\n" ] } ], "source": [ "# Just print the structured tuple returned by the function call\n", "FeatureName='word'\n", "MetaData=Fs(FeatureName).meta\n", "print (MetaData)" ] }, { "cell_type": "markdown", "id": "c9fc2cac-b1f6-430e-b900-82fd7fece295", "metadata": {}, "source": [ "Now do some very basic calculation with the data:" ] }, { "cell_type": "code", "execution_count": 13, "id": "cbe58101-e241-44e0-9a36-aeef8aa47bc6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "feature word is of type str.\n" ] } ], "source": [ "print ('feature ',FeatureName, end='')\n", "if MetaData['valueType']=='str':\n", " print (' is of type str.')\n", "else:\n", " print (' is not of type str.')" ] }, { "cell_type": "markdown", "id": "08c67b53-bd6c-42e6-a0cf-b7f609cd9879", "metadata": {}, "source": [ "# trying the various formats" ] }, { "cell_type": "code", "execution_count": null, "id": "cf68d1b9-cbec-470a-8726-31ef9a475603", "metadata": {}, "outputs": [], "source": [ "origText=T.text(node,fmt='text-orig-full')\n", "critText=T.text(node,fmt='text-critical-signs')\n", "\n", " 'fmt:text-orig-full': '{word}{after}',\n", " 'fmt:text-normalized': '{normalized}{after}',\n", " 'fmt:text-unaccented': '{wordunacc}{after}',\n", " 'fmt:text-transliterated':'{wordtranslit}{after}', \n", " 'fmt:text-critical': " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }