{ "cells": [ { "cell_type": "markdown", "id": "1cf27c95-0b45-4d97-a62d-9950654eb386", "metadata": {}, "source": [ "# Some corpus statistics (Nestle1904GBI)" ] }, { "cell_type": "markdown", "id": "1495a021-daa1-4c2e-80d5-ab7d2d75bc3f", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Table of content \n", "* 1 - Introduction\n", "* 2 - Load Text-Fabric app and data\n", "* 3 - Performing the queries\n", " * 3.1 - The 25 most frequent words in the corpus\n", " * 3.2 - Frequency of characters in corpus\n", " * 3.3 - Some stats on node types \n", " * 3.4 - The available text formats \n", " * 3.5 - List of feature frequencies \n", " * 3.6 - Frequency list of punctuations\n", " * 3.7 - Node number ranges\n", " * 3.8 - Count the objects per type\n", "* 4 - Required libraries" ] }, { "cell_type": "markdown", "id": "e6830070-1e97-4bdf-aa0c-5eda4e624a84", "metadata": {}, "source": [ "# 1 - Introduction \n", "##### [Back to TOC](#TOC)\n", "\n", "This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed. " ] }, { "cell_type": "markdown", "id": "a1b900e2-995f-4f36-ad74-d821092ca02c", "metadata": {}, "source": [ "# 2 - Load Text-Fabric app and data \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 1, "id": "6bd6c621-361d-487f-a8df-c27fb1ec9de2", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 1, "id": "0071a0db-916c-4357-88bd-6b3255af0764", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Loading the Text-Fabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 2, "id": "ed76db5d-5463-4bf1-99ca-7f14b3a0f277", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The requested app is not available offline\n", "\t~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app not found\n" ] }, { "data": { "text/html": [ "Status: latest release online 0.4 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The requested data is not available offline\n", "\t~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 not found\n" ] }, { "data": { "text/html": [ "Status: latest release online 0.4 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.19s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 1.85s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.68s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.53s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.51s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.64s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.50s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | | 0.05s C __levels__ from otype, oslots, otext\n", " | | 1.62s C __order__ from otype, oslots, __levels__\n", " | | 0.07s C __rank__ from otype, __order__\n", " | | 2.23s C __levUp__ from otype, oslots, __rank__\n", " | | 1.46s C __levDown__ from otype, __levUp__, __rank__\n", " | | 0.06s C __characters__ from otext\n", " | | 0.88s C __boundary__ from otype, oslots, __rank__\n", " | | 0.04s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n", " | | 0.21s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse\n", " | 0.50s T booknum from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.58s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.48s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.50s T clause from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.07s T clauserule from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.02s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.43s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.53s T formaltag from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.54s T functionaltag from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.57s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.47s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.55s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.51s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.53s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.44s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.43s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.64s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.59s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.49s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.50s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.43s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.70s T phrase from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.26s T phrasefunction from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.28s T phrasefunctionlong from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.27s T phrasetype from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.46s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.51s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.51s T splong from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.54s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.45s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.45s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.45s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n", " | 0.45s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4\n" ] }, { "data": { "text/html": [ "\n", " TF: TF API 12.1.5, tonyjurg/Nestle1904GBI/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904GBI 0.4, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book275102.93100
chapter260529.92100
sentence572024.09100
verse794317.35100
clause161248.54100
phrase726741.90100
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904 (GBI nodes)\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " Character after the word (space or punctuation)\n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " Book name (fully spelled out)\n", "\n", "
\n", "\n", "
\n", "
\n", "booknum\n", "
\n", "
int
\n", "\n", " NT book number (Matthew=1, Mark=2, ..., Revelation=27)\n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " Book name (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " Chapter number inside book\n", "\n", "
\n", "\n", "
\n", "
\n", "clause\n", "
\n", "
int
\n", "\n", " Clause number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "clauserule\n", "
\n", "
str
\n", "\n", " Clause rule\n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " Clause type\n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " Degree (e.g. Comparitative, Superlative)\n", "\n", "
\n", "\n", "
\n", "
\n", "formaltag\n", "
\n", "
str
\n", "\n", " Formal tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "functionaltag\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " English gloss\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " Gramatical gender (Masculine, Feminine, Neuter)\n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " Lexeme (lemma)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG\n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " Lauw-Nida lexical classification\n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " Sequence number of the smallest meaningful unit of text (single word)\n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " Gramatical mood of the verb (passive, etc)\n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " Node ID (as in the XML source data)\n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " Surface word stripped of punctations\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " Gramatical number (Singular, Plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " Gramatical number of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " Gramatical person of the verb (first, second, third)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrase\n", "
\n", "
int
\n", "\n", " Phrase number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrasefunction\n", "
\n", "
str
\n", "\n", " Phrase function (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrasefunctionlong\n", "
\n", "
str
\n", "\n", " Phrase function (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrasetype\n", "
\n", "
str
\n", "\n", " Phrase type information\n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " Sentence number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " Speech Part (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "splong\n", "
\n", "
str
\n", "\n", " Speech Part (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " Strongs number\n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " Subject reference (to nodeID in XML source data)\n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " Gramatical tense of the verb (e.g. Present, Aorist)\n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " Gramatical type of noun or pronoun (e.g. Common, Personal)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " Verse number inside chapter\n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " Gramatical voice of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " Word as it appears in the text\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: tonyjurg/Nestle1904GBI
  3. appPath:C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904GBI/app
  4. commit: no value
  5. css:
  6. dataDisplay:
    • excludedFeatures: [reference]
    • noneValues:
      • none
      • unknown
      • no value
      • NA
      • ''
    • textFormat: text-orig-full
  7. interfaceDefaults: {fmt: layout-orig-full}
  8. isCompatible: True
  9. local: no value
  10. localDir:C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904GBI/_temp
  11. provenanceSpec:
    • corpus: Nestle 1904 (GBI nodes)
    • org: tonyjurg
    • relative: /tf
    • repo: Nestle1904GBI
    • repro: Nestle1904GBI
    • version: 0.4
    • webUrl:https://bibleol.3bmoodle.dk/text/show_text/nestle1904/<1>/<2>/<3>
  12. release: no value
  13. typeDisplay:
    • book:
      • label: {book}
      • style: ''
    • clause:
      • label: #{clause}
      • style: ''
    • phrase:
      • label: #{phrase}
      • style: ''
    • word:
      • features:
        • lemma
        • strongs
      • featuresBare: [gloss]
  14. writing: grc
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the N1904 app and data\n", "N1904 = use (\"tonyjurg/Nestle1904GBI\", version=\"0.4\", hoist=globals())" ] }, { "cell_type": "code", "execution_count": 3, "id": "820ae775-642a-48fb-b349-262354a8a218", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)\n", "N1904.dh(N1904.getCss())" ] }, { "cell_type": "markdown", "id": "58ef1678-a19d-4c0c-80f3-84f8471a90e2", "metadata": { "tags": [] }, "source": [ "# 3 - Performing the queries \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "b59c83bd-329d-4820-8bcc-ca92e1c55f6d", "metadata": {}, "source": [ "## 3.1 - The 25 most frequent words in the corpus\n", "##### [Back to TOC](#TOC)\n", "\n", "The method [`freqList`](https://annotation.github.io/text-fabric/tf/core/nodefeature.html#tf.core.nodefeature.NodeFeature.freqList) returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first." ] }, { "cell_type": "code", "execution_count": 4, "id": "1d4b1b93-08e5-41f4-a587-66e444a3e271", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Amount\tword\n", "8541\tκαὶ\n", "2768\tὁ\n", "2683\tἐν\n", "2620\tδὲ\n", "2497\tτοῦ\n", "1755\tεἰς\n", "1657\tτὸ\n", "1556\tτὸν\n", "1518\tτὴν\n", "1410\tαὐτοῦ\n", "1300\tτῆς\n", "1281\tὅτι\n", "1221\tτῷ\n", "1201\tτῶν\n", "1068\tοἱ\n", "941\tἡ\n", "921\tγὰρ\n", "902\tμὴ\n", "859\tτῇ\n", "849\tαὐτῷ\n", "817\tτὰ\n", "767\tοὐκ\n", "722\tτοὺς\n", "688\tΘεοῦ\n", "670\tπρὸς\n" ] } ], "source": [ "print(\"Amount\\tword\")\n", "for (w, amount) in F.word.freqList(\"word\")[0:25]:\n", " print(f\"{amount}\\t{w}\")" ] }, { "cell_type": "markdown", "id": "211b2bde-002b-4243-87c9-4bd850868354", "metadata": { "jupyter": { "outputs_hidden": true }, "tags": [] }, "source": [ "## 3.2 - Frequency of characters in corpus \n", "##### [Back to TOC](#TOC)\n", "\n", "This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table. \n", "\n", "Note the first line of the output is 'Format: text-orig-full'. This " ] }, { "cell_type": "code", "execution_count": 6, "id": "b8e8ce2d-43db-48dd-ace9-2156c7046692", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Format: text-orig-full\n", "╒═════════════╤═════════════╕\n", "│ character │ frequency │\n", "╞═════════════╪═════════════╡\n", "│ │ 137779 │\n", "├─────────────┼─────────────┤\n", "│ ν │ 56230 │\n", "├─────────────┼─────────────┤\n", "│ α │ 51892 │\n", "├─────────────┼─────────────┤\n", "│ τ │ 50599 │\n", "├─────────────┼─────────────┤\n", "│ ο │ 45151 │\n", "├─────────────┼─────────────┤\n", "│ ε │ 38597 │\n", "├─────────────┼─────────────┤\n", "│ ς │ 27090 │\n", "├─────────────┼─────────────┤\n", "│ ι │ 26131 │\n", "├─────────────┼─────────────┤\n", "│ σ │ 24095 │\n", "├─────────────┼─────────────┤\n", "│ ρ │ 22871 │\n", "├─────────────┼─────────────┤\n", "│ κ │ 22630 │\n", "├─────────────┼─────────────┤\n", "│ π │ 20308 │\n", "├─────────────┼─────────────┤\n", "│ μ │ 19218 │\n", "├─────────────┼─────────────┤\n", "│ λ │ 18228 │\n", "├─────────────┼─────────────┤\n", "│ δ │ 12476 │\n", "╘═════════════╧═════════════╛\n" ] }, { "data": { "text/markdown": [ "**Warning: table truncated!**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Library to format table\n", "from tabulate import tabulate\n", "\n", "# The following API call will result in a Python dictionary structure\n", "FrequencyDictionary=C.characters.data\n", "\n", "# Present the results\n", "KeyList = list(FrequencyDictionary.keys())\n", "for Key in KeyList:\n", " print('Format: ',Key)\n", " # 'key' refers to the pre-defined formats the text will be displayed\n", " FrequencyList=FrequencyDictionary[Key]\n", " SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)\n", " \n", " # In this example the table will be truncated to the first 15 entries\n", " max_rows = 15 # Set your desired number of rows here\n", " TruncatedTable = SortedFrequencyList[:max_rows]\n", " \n", " headers = [\"character\", \"frequency\"]\n", " print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))\n", " \n", " # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type\n", " N1904.dm(\"**Warning: table truncated!**\")" ] }, { "cell_type": "markdown", "id": "75627859-1d9c-4d99-9020-d2302f6de408", "metadata": {}, "source": [ "## 3.3 - Some stats on node types \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 44, "id": "b5ce40f1-9a22-444f-955a-c5545797a056", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('book', 5102.925925925926, 137780, 137806),\n", " ('chapter', 529.9192307692308, 137807, 138066),\n", " ('sentence', 24.087237762237763, 226865, 232584),\n", " ('verse', 17.345965000629484, 232585, 240527),\n", " ('clause', 8.54496402877698, 138067, 154190),\n", " ('phrase', 1.8958499600957701, 154191, 226864),\n", " ('word', 1, 1, 137779))" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C.levels.data" ] }, { "cell_type": "markdown", "id": "f6ad9acc-92e3-47b9-bfaf-8c06ed33ada4", "metadata": { "tags": [] }, "source": [ "## 3.4 - The available text formats \n", "##### [Back to TOC](#TOC)\n", "\n", "Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also [module tf.advanced.options\n", "Display Settings](https://annotation.github.io/text-fabric/tf/advanced/options.html)." ] }, { "cell_type": "code", "execution_count": 19, "id": "97137d58-68cb-4383-a545-5668e603493f", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "format | level | template\n", "--- | --- | ---\n", "`text-orig-full` | **word** | `{word}{after}`\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "N1904.showFormats()" ] }, { "cell_type": "markdown", "id": "1bae482a-9abb-4280-a52c-b3011037fded", "metadata": {}, "source": [ "The same result (although formatted different, since an ordered tuple is returned) can be obtained by the following call:" ] }, { "cell_type": "code", "execution_count": 8, "id": "acaaf356-eeae-4101-b5ef-090607dca5fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'text-orig-full': 'word'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.formats" ] }, { "cell_type": "markdown", "id": "76294b50-192f-47e0-95c2-09a1ca79fe17", "metadata": {}, "source": [ "Note that this data originates from file `otext.tf`:\n", "\n", "> \n", "```\n", "@config\n", "...\n", "@fmt:text-orig-full={word}{after}\n", "...\n", "```\n" ] }, { "cell_type": "markdown", "id": "d23c6817", "metadata": {}, "source": [ "## 3.5 - List of feature frequencies \n", "##### [Back to TOC](#TOC)\n", "\n", "This code generates a lot of output!" ] }, { "cell_type": "code", "execution_count": 7, "id": "75b2827e-81e1-4e28-a46c-bd50bc56a5aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature: after \n", "\n", "\t value\t frequency\n", "\t \t 119272\n", "\t , \t 9441\n", "\t . \t 5712\n", "\t · \t 2355\n", "\t ; \t 969\n", "\n", "\n", "Feature: book \n", "\n", "\t value\t frequency\n", "\t Luke \t 22801\n", "\t Matthew \t 21334\n", "\t Acts \t 21290\n", "\t John \t 18389\n", "\t Mark \t 13247\n", "\n", "\n", "Feature: booknum \n", "\n", "\t value\t frequency\n", "\t 3 \t 22801\n", "\t 1 \t 21334\n", "\t 5 \t 21290\n", "\t 4 \t 18389\n", "\t 2 \t 13247\n", "\n", "\n", "Feature: bookshort \n", "\n", "\t value\t frequency\n", "\t Luke \t 22801\n", "\t Matt \t 21334\n", "\t Acts \t 21290\n", "\t John \t 18389\n", "\t Mark \t 13247\n", "\n", "\n", "Feature: case \n", "\n", "\t value\t frequency\n", "\t \t 58261\n", "\t Nominative \t 24197\n", "\t Accusative \t 23031\n", "\t Genitive \t 19515\n", "\t Dative \t 12126\n", "\n", "\n", "Feature: chapter \n", "\n", "\t value\t frequency\n", "\t 1 \t 13795\n", "\t 2 \t 11590\n", "\t 3 \t 10239\n", "\t 4 \t 10187\n", "\t 5 \t 9270\n", "\n", "\n", "Feature: clause \n", "\n", "\t value\t frequency\n", "\t 1 \t 481\n", "\t 6 \t 347\n", "\t 44 \t 314\n", "\t 35 \t 310\n", "\t 4 \t 301\n", "\n", "\n", "Feature: clauserule \n", "\n", "\t value\t frequency\n", "\t CLaCL \t 1841\n", "\t Conj-CL \t 1740\n", "\t sub-CL \t 1525\n", "\t V-O \t 690\n", "\t V2CL \t 653\n", "\n", "\n", "Feature: clausetype \n", "\n", "\t value\t frequency\n", "\t VerbElided \t 1355\n", "\t Verbless \t 1330\n", "\t Minor \t 1161\n", "\n", "\n", "Feature: degree \n", "\n", "\t value\t frequency\n", "\t \t 137266\n", "\t Comparative \t 313\n", "\t Superlative \t 200\n", "\n", "\n", "Feature: formaltag \n", "\n", "\t value\t frequency\n", "\t CONJ \t 16316\n", "\t PREP \t 10568\n", "\t ADV \t 3808\n", "\t N-NSM \t 3475\n", "\t N-GSM \t 2935\n", "\n", "\n", "Feature: functionaltag \n", "\n", "\t value\t frequency\n", "\t CONJ \t 16316\n", "\t PREP \t 10568\n", "\t ADV \t 3808\n", "\t N-NSM \t 3475\n", "\t N-GSM \t 2935\n", "\n", "\n", "Feature: gloss \n", "\n", "\t value\t frequency\n", "\t the \t 9857\n", "\t and \t 6212\n", "\t - \t 5496\n", "\t in \t 2320\n", "\t And \t 2218\n", "\n", "\n", "Feature: gn \n", "\n", "\t value\t frequency\n", "\t \t 63804\n", "\t Masculine \t 41486\n", "\t Feminine \t 18736\n", "\t Neuter \t 13753\n", "\n", "\n", "Feature: lemma \n", "\n", "\t value\t frequency\n", "\t ὁ \t 19783\n", "\t καί \t 8978\n", "\t αὐτός \t 5561\n", "\t σύ \t 2892\n", "\t δέ \t 2787\n", "\n", "\n", "Feature: lex_dom \n", "\n", "\t value\t frequency\n", "\t 092004 \t 26322\n", "\t \t 10487\n", "\t 089017 \t 4370\n", "\t 093001 \t 3672\n", "\t 033006 \t 3225\n", "\n", "\n", "Feature: ln \n", "\n", "\t value\t frequency\n", "\t 92.24 \t 19781\n", "\t \t 10488\n", "\t 92.11 \t 4718\n", "\t 89.92 \t 2903\n", "\t 89.87 \t 2756\n", "\n", "\n", "Feature: monad \n", "\n", "\t value\t frequency\n", "\t 1 \t 1\n", "\t 2 \t 1\n", "\t 3 \t 1\n", "\t 4 \t 1\n", "\t 5 \t 1\n", "\n", "\n", "Feature: mood \n", "\n", "\t value\t frequency\n", "\t \t 109422\n", "\t Indicative \t 15617\n", "\t Participle \t 6653\n", "\t Infinitive \t 2285\n", "\t Imperative \t 1877\n", "\n", "\n", "Feature: nodeID \n", "\n", "\t value\t frequency\n", "\t n40001001001 \t 1\n", "\t n40001001002 \t 1\n", "\t n40001001003 \t 1\n", "\t n40001001004 \t 1\n", "\t n40001001005 \t 1\n", "\n", "\n", "Feature: normalized \n", "\n", "\t value\t frequency\n", "\t καί \t 8576\n", "\t ὁ \t 2769\n", "\t δέ \t 2764\n", "\t ἐν \t 2684\n", "\t τοῦ \t 2497\n", "\n", "\n", "Feature: nu \n", "\n", "\t value\t frequency\n", "\t Singular \t 69846\n", "\t \t 38842\n", "\t Plural \t 29091\n", "\n", "\n", "Feature: number \n", "\n", "\t value\t frequency\n", "\t Singular \t 69846\n", "\t \t 38842\n", "\t Plural \t 29091\n", "\n", "\n" ] } ], "source": [ "FeatureList=Fall()\n", "LinesToPrint=5\n", "for Feature in FeatureList:\n", " if Feature=='otype': break # this feature needs to be skipped.\n", " print ('Feature:',Feature,'\\n\\n\\t value\\t frequency')\n", " FeatureFrequenceLists=Fs(Feature).freqList()\n", " PrintedLine=0\n", " for item, freq in FeatureFrequenceLists:\n", " PrintedLine+=1\n", " print ('\\t',item,'\\t',freq)\n", " if PrintedLine==LinesToPrint: break\n", " print ('\\n')" ] }, { "cell_type": "markdown", "id": "cba64820-a3e6-4b40-8a25-e0f95f2fd66e", "metadata": { "tags": [] }, "source": [ "## 3.6 - Frequency list of punctuations \n", "##### [Back to TOC](#TOC)\n", "\n", "Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved." ] }, { "cell_type": "code", "execution_count": 7, "id": "c797fa57-d536-4471-b44d-d3a45653f34a", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ " String | Unicode | Frequency\n", "--- | --- | ---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " ` ` | 32 | 119272 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `,` | 44 | 9441 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `.` | 46 | 5712 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `·` | 183 | 2355 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `;` | 59 | 969 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ " `—` | 8212 | 30 " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "result = F.after.freqList()\n", "N1904.dm(\" String | Unicode | Frequency\\n--- | --- | ---\")\n", "for (string, freq) in result:\n", " # important: string does contain two characters in case of punctuations\n", " frequency=str(freq) #convert it to a string\n", " unicode_value = str(ord(string[0])) #convert it to a string\n", " N1904.dm(\" `{}` | {} | {} \".format(string[0],unicode_value,frequency)) " ] }, { "cell_type": "markdown", "id": "b3cbf04f", "metadata": {}, "source": [ "## 3.7 - Node number ranges \n", "##### [Back to TOC](#TOC)\n", "\n", "The node number ranges are readily available by calling `F.otype.all` which returns a list of all node types. " ] }, { "cell_type": "code", "execution_count": 8, "id": "20dd1920", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "book (137780, 137806)\n", "chapter (137807, 138066)\n", "sentence (226865, 232584)\n", "verse (232585, 240527)\n", "clause (138067, 154190)\n", "phrase (154191, 226864)\n", "word (1, 137779)\n" ] } ], "source": [ "for NodeType in F.otype.all:\n", " print (NodeType, F.otype.sInterval(NodeType))" ] }, { "cell_type": "markdown", "id": "86e62381-0fdd-4e56-8855-11e8c73aec7e", "metadata": {}, "source": [ "## 3.8 - Count the objects per type \n", "##### [Back to TOC](#TOC)\n", "\n", "Using the same API call, we can produce also another list where we are counting the number of nodes for each type." ] }, { "cell_type": "code", "execution_count": 9, "id": "dc4b5cae-9f19-4a42-aa9e-6decf3df4c2f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 27 books\n", " 260 chapters\n", " 5720 sentences\n", " 7943 verses\n", " 16124 clauses\n", " 72674 phrases\n", " 137779 words\n" ] } ], "source": [ "for otype in F.otype.all:\n", " i = 0\n", " for n in F.otype.s(otype):\n", " i += 1\n", " print (\"{:>7} {}s\".format(i, otype))" ] }, { "cell_type": "code", "execution_count": 8, "id": "c5730f29-e9d8-4483-9493-b31b7efbdafd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
Job:
Ellipsis
\n", "
\n", "
\n", "
Author:
program author
\n", "
\n", "
\n", "
Created:
2023-07-24T17:00:59+02:00
\n", "
\n", "
\n", "
Data:
\n", "
Nestle 1904 (GBI nodes)
\n", "
\n", "
\n", "
version
\n", "
0.4
\n", "
\n", "
\n", "
release
\n", "
0.4
\n", "
\n", " \n", "
\n", "
DOI
\n", "
no DOI
\n", "
\n", " \n", "
\n", "
Tool:
\n", "
Text-Fabric 11.4.10 10.5281/zenodo.592193
\n", "
\n", "
\n", "
TF App:
\n", "
tonyjurg/Nestle1904GBI on GitHub
\n", "
\n", "
\n", "
commit
\n", " \n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "N1904.showProvenance(...)" ] }, { "cell_type": "markdown", "id": "5c320845-6984-4939-961b-69189baa3cb8", "metadata": {}, "source": [ "# 4 - Required libraries \n", "##### [Back to TOC](#TOC)\n", "\n", "The scripts in this notebook require (beside `text-fabric`) the following Python libraries to be installed in the environment:\n", "\n", " tabulate\n", "\n", "You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`." ] }, { "cell_type": "code", "execution_count": null, "id": "c858fc76-e203-48a2-8c8c-4dc90f35effc", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }