# Some corpus statistics (Nestle1904GBI)

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 - The 25 most frequent words in the corpus</a>
    * <a href="#bullet3x2">3.2 - Frequency of characters in corpus</a>
    * <a href="#bullet3x3">3.3 - Some stats on node types</a>    
    * <a href="#bullet3x4">3.4 - The available text formats</a>    
    * <a href="#bullet3x5">3.5 - List of feature frequencies</a> 
    * <a href="#bullet3x6">3.6 - Frequency list of punctuations</a>
    * <a href="#bullet3x7">3.7 - Node number ranges</a>
    * <a href="#bullet3x8">3.8 - Count the objects per type</a>
* <a href="#bullet4">4 - Required libraries</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed. 

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [2]:
# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904GBI", version="0.4", hoist=globals())

**Locating corpus resources ...**

The requested app is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app not found


The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 not found


   |     0.19s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     1.85s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.68s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.53s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.51s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.64s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.50s T after                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |      |     0.05s C __levels__           from otype, oslots, otext
   |      |     1.62s C __order__            from otype, oslots, __levels__
   |      |     0.07s C __rank__             from otype, __order__
   |      |     2.23s C __levUp__            from otype, oslots, __rank__
   |      |     1.4

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
sentence,5720,24.09,100
verse,7943,17.35,100
clause,16124,8.54,100
phrase,72674,1.9,100
word,137779,1.0,100


In [3]:
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

# 3 - Performing the queries <a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

## 3.1 - The 25 most frequent words in the corpus<a class="anchor" id="bullet3x1"></a>
##### [Back to TOC](#TOC)

The method [`freqList`](https://annotation.github.io/text-fabric/tf/core/nodefeature.html#tf.core.nodefeature.NodeFeature.freqList) returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.

In [4]:
print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
    print(f"{amount}\t{w}")

Amount	word
8541	καὶ
2768	ὁ
2683	ἐν
2620	δὲ
2497	τοῦ
1755	εἰς
1657	τὸ
1556	τὸν
1518	τὴν
1410	αὐτοῦ
1300	τῆς
1281	ὅτι
1221	τῷ
1201	τῶν
1068	οἱ
941	ἡ
921	γὰρ
902	μὴ
859	τῇ
849	αὐτῷ
817	τὰ
767	οὐκ
722	τοὺς
688	Θεοῦ
670	πρὸς


## 3.2 - Frequency of characters in corpus <a class="anchor" id="bullet3x2"></a>
##### [Back to TOC](#TOC)

This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table. 

Note the first line of the output is 'Format:  text-orig-full'. This 

In [6]:
# Library to format table
from tabulate import tabulate

# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data

# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
    print('Format: ',Key)
    # 'key' refers to the pre-defined formats the text will be displayed
    FrequencyList=FrequencyDictionary[Key]
    SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
    
    # In this example the table will be truncated to the first 15 entries
    max_rows = 15  # Set your desired number of rows here
    TruncatedTable = SortedFrequencyList[:max_rows]
    
    headers = ["character", "frequency"]
    print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
    
    # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
    N1904.dm("**Warning: table truncated!**")

Format:  text-orig-full
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
╘═════════════╧═

**Warning: table truncated!**

## 3.3 - Some stats on node types <a class="anchor" id="bullet3x3"></a>
##### [Back to TOC](#TOC)

In [44]:
C.levels.data

(('book', 5102.925925925926, 137780, 137806),
 ('chapter', 529.9192307692308, 137807, 138066),
 ('sentence', 24.087237762237763, 226865, 232584),
 ('verse', 17.345965000629484, 232585, 240527),
 ('clause', 8.54496402877698, 138067, 154190),
 ('phrase', 1.8958499600957701, 154191, 226864),
 ('word', 1, 1, 137779))

## 3.4 - The available text formats <a class="anchor" id="bullet3x4"></a>
##### [Back to TOC](#TOC)

Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also [module tf.advanced.options
Display Settings](https://annotation.github.io/text-fabric/tf/advanced/options.html).

In [19]:
N1904.showFormats()

format | level | template
--- | --- | ---
`text-orig-full` | **word** | `{word}{after}`


The same result (although formatted different, since an ordered tuple is returned) can be obtained by the following call:

In [8]:
T.formats

{'text-orig-full': 'word'}

Note that this data originates from file `otext.tf`:

> 
```
@config
...
@fmt:text-orig-full={word}{after}
...
```


## 3.5 - List of feature frequencies <a class="anchor" id="bullet3x5"></a>
##### [Back to TOC](#TOC)

This code generates a lot of output!

In [7]:
FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList:
    if Feature=='otype': break # this feature needs to be skipped.
    print ('Feature:',Feature,'\n\n\t value\t frequency')
    FeatureFrequenceLists=Fs(Feature).freqList()
    PrintedLine=0
    for item, freq in FeatureFrequenceLists:
        PrintedLine+=1
        print ('\t',item,'\t',freq)
        if PrintedLine==LinesToPrint: break
    print ('\n')

Feature: after 

	 value	 frequency
	   	 119272
	 ,  	 9441
	 .  	 5712
	 ·  	 2355
	 ;  	 969


Feature: book 

	 value	 frequency
	 Luke 	 22801
	 Matthew 	 21334
	 Acts 	 21290
	 John 	 18389
	 Mark 	 13247


Feature: booknum 

	 value	 frequency
	 3 	 22801
	 1 	 21334
	 5 	 21290
	 4 	 18389
	 2 	 13247


Feature: bookshort 

	 value	 frequency
	 Luke 	 22801
	 Matt 	 21334
	 Acts 	 21290
	 John 	 18389
	 Mark 	 13247


Feature: case 

	 value	 frequency
	  	 58261
	 Nominative 	 24197
	 Accusative 	 23031
	 Genitive 	 19515
	 Dative 	 12126


Feature: chapter 

	 value	 frequency
	 1 	 13795
	 2 	 11590
	 3 	 10239
	 4 	 10187
	 5 	 9270


Feature: clause 

	 value	 frequency
	 1 	 481
	 6 	 347
	 44 	 314
	 35 	 310
	 4 	 301


Feature: clauserule 

	 value	 frequency
	 CLaCL 	 1841
	 Conj-CL 	 1740
	 sub-CL 	 1525
	 V-O 	 690
	 V2CL 	 653


Feature: clausetype 

	 value	 frequency
	 VerbElided 	 1355
	 Verbless 	 1330
	 Minor 	 1161


Feature: degree 

	 value	 frequency
	  	 

## 3.6 - Frequency list of punctuations <a class="anchor" id="bullet3x6"></a>
##### [Back to TOC](#TOC)

Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.

In [7]:
result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
    # important: string does contain two characters in case of punctuations
    frequency=str(freq)             #convert it to a string
    unicode_value = str(ord(string[0])) #convert it to a string
    N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))  

 String | Unicode | Frequency
--- | --- | ---

 ` ` | 32 | 119272 

 `,` | 44 | 9441 

 `.` | 46 | 5712 

 `·` | 183 | 2355 

 `;` | 59 | 969 

 `—` | 8212 | 30 

## 3.7 - Node number ranges <a class="anchor" id="bullet3x7"></a>
##### [Back to TOC](#TOC)

The node number ranges are readily available by calling `F.otype.all` which returns a list of all node types. 

In [8]:
for NodeType in F.otype.all:
    print (NodeType, F.otype.sInterval(NodeType))

book (137780, 137806)
chapter (137807, 138066)
sentence (226865, 232584)
verse (232585, 240527)
clause (138067, 154190)
phrase (154191, 226864)
word (1, 137779)


## 3.8 - Count the objects per type <a class="anchor" id="bullet3x8"></a>
##### [Back to TOC](#TOC)

Using the same API call, we can produce also another list where we are counting the number of nodes for each type.

In [9]:
for otype in F.otype.all:
    i = 0
    for n in F.otype.s(otype):
        i += 1
    print ("{:>7} {}s".format(i, otype))

     27 books
    260 chapters
   5720 sentences
   7943 verses
  16124 clauses
  72674 phrases
 137779 words


In [8]:
N1904.showProvenance(...)

# 4 - Required libraries <a class="anchor" id="bullet4"></a>
##### [Back to TOC](#TOC)

The scripts in this notebook require (beside `text-fabric`) the following Python libraries to be installed in the environment:

    tabulate

You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.