# Various text formats (N1904-TF)

## Table of content (TOC) <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
    * <a href="#bullet1x1">1.1 - Naming schema for text formating</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Examining the text formats</a>
    * <a href="#bullet3x1">3.1 - Display the formatting options available for this corpus</a>
    * <a href="#bullet3x2">3.2 - Showcasing the various formats</a>
    * <a href="#bullet3x3">3.3 - Transliterated text</a>
    * <a href="#bullet3x4">3.4 - Text with text critical markers</a>
    * <a href="#bullet3x5">3.5 - Nestle version 1904 and version 1913 (Mark 1:1)</a>
* <a href="#bullet4">4 - Notebook version</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

This Jupyter Notebook is designed to demonstrate the predefined text formats available in this Text-Fabric dataset, specifically focusing on displaying the Greek surface text of the New Testament.

Text-Fabric's data design allows for flexible representation of the corpus text but requires at least one text format to be specified as its default (in this dataset: text-orig-full). During the creation of the dataset, additional formats relevant to this corpus were defined, which are basically based on a subset of the following surface text-related features:

   * [after](https://centerblc.github.io/N1904/features/after.html#start): All material found after a word (including text-critical signs).
   * [before](https://centerblc.github.io/N1904/features/before.html#start): All material found before a word.
   * [criticalsign](https://centerblc.github.io/N1904/features/criticalsign.html#start): Text-critical signs.
   * [normalized](https://centerblc.github.io/N1904/features/normalized.html#start): Normalized Greek text.
   * [punctuation](https://centerblc.github.io/N1904/features/punctuation.html#start): Punctuations found after a word.
   * [text](https://centerblc.github.io/N1904/features/text.html#start): Word without punctuations and text-critical signs.
   * [trailer](https://centerblc.github.io/N1904/features/trailer.html#start): All material found after a word (excluding text-critical signs).
   * [translit](https://centerblc.github.io/N1904/features/translit.html#start): Transliteration of the word surface texts.
   * [unaccent](https://centerblc.github.io/N1904/features/unaccent.html#start): Word without accents and diacritical markers.
   * [unicode](https://centerblc.github.io/N1904/features/unicode.html#start): Unicode presentation including all material before and after word.

The relation between these features in relation to the surface text is shown in the following image.

<img src="https://github.com/CenterBLC/N1904/raw/main/docs/features/images/details_surface_features.png" width="400" >

## 1.1 - Naming schema for text formating<a class="anchor" id="bullet1x1"></a>

The text formats in this Text-Fabric database are identified by unique names that reflect their actual formats. These names follow a structured naming schema, consisting of a string of keywords separated by hyphens (-).

```
 `what`-`how`-`fullness`
```

In our database the following keywords are used:

<style>
  table.custom-table { float: left; border-collapse: collapse; width: 500px;}
  table.custom-table th, table.custom-table td { border: 1px solid black; padding: 8px; }
</style>

<table class="custom-table">
  <tr><th>Keyword</th> <th>Value</th> <th>Meaning</th></tr>
  <tr><td>what</td><td>text</td><td>words as they belong to the text</td></tr>
  <tr><td>what</td><td>lex</td><td>lexemes of the words</td></tr>
  <tr><td>how</td><td>orig</td><td>the original Greek script (all Unicode)</td>
  </tr><tr><td>how</td><td>unaccent</td><td>the original Greek script without accents</td></tr>
  <tr><td>how</td><td>translit</td><td>transliteration into Latin alphabet</td></tr>
  <tr><td>fullness</td><td>full</td><td>complete text with text-critical markers</td></tr>
  <tr><td>fullness</td><td>plain</td><td>complete text without text-critical markers</td></tr>
</table>


Not all possible combinations are defined or relevant. The following text-formatting options are defined:

<table class="custom-table">
  <tr><th>Format</th><th>Usage</th><th>Template</th></tr>
  <tr><td>lex-orig-plain</td><td>Lexemes of the Greek surface text</td>
    <td><a href="https://centerblc.github.io/N1904/features/lemma.html#start" target="_blank">{lemma}</a><a href="https://centerblc.github.io/N1904/features/trailer.html#start" target="_blank">{trailer}</a></td>
  </tr>
  <tr><td>lex-translit-plain</td><td>Transliteration of the lexemes of the Greek surface text</td>
    <td><a href="https://centerblc.github.io/N1904/features/lemmatranslit.html#start" target="_blank">{lemmatranslit}</a>
        <a href="https://centerblc.github.io/N1904/features/trailer.html#start" target="_blank">{trailer}</a></td>
  </tr>
  <tr><td>text-orig-full (default)</td><td>The Greek surface text in unicode including text-critical markers</td>
    <td><a href="https://centerblc.github.io/N1904/features/before.html#start" target="_blank">{before}</a>
        <a href="https://centerblc.github.io/N1904/features/text.html#start" target="_blank">{text}</a>
        <a href="https://centerblc.github.io/N1904/features/after.html#start" target="_blank">{after}</a></td>
  </tr>
  <tr><td>text-orig-plain</td><td>The Greek surface text in unicode</td>
    <td><a href="https://centerblc.github.io/N1904/features/text.html#start" target="_blank">{text}</a>
        <a href="https://centerblc.github.io/N1904/features/trailer.html#start" target="_blank">{trailer}</a></td>
  </tr>
  <tr><td>text-translit-plain</td><td>Transliteration of the Greek surface text</td>
    <td><a href="https://centerblc.github.io/N1904/features/translit.html#start" target="_blank">{translit}</a>
        <a href="https://centerblc.github.io/N1904/features/trailer.html#start" target="_blank">{trailer}</a></td>
  </tr>
  <tr><td>text-unaccent-plain</td><td>The Greek surface text in unicode without accents</td>
    <td><a href="https://centerblc.github.io/N1904/features/unaccent.html#start" target="_blank">{unaccent}</a>
        <a href="https://centerblc.github.io/N1904/features/trailer.html#start" target="_blank">{trailer}</a></td>
  </tr>
</table>


# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [5]:
# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/CenterBLC/N1904/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/CenterBLC/N1904/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [7]:
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

# 3 - Examining the text format<a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

## 3.1 - Display the text formatting options available for this corpus<a class="anchor" id="bullet3x1"></a>

The output of the following command provides details on available formats to present the text of the corpus. 

See also <a href="https://annotation.github.io/text-fabric/tf/advanced/options.html" target="_blank">module tf.advanced.options
Display Settings</a>.

In [9]:
N1904.showFormats()

format | level | template
--- | --- | ---
`lex-orig-plain` | **word** | `{lemma}{trailer}`
`lex-translit-plain` | **word** | `{lemmatranslit}{trailer}`
`text-orig-full` | **word** | `{before}{text}{after}`
`text-orig-plain` | **word** | `{text}{trailer}`
`text-translit-plain` | **word** | `{translit}{trailer}`
`text-unaccent-plain` | **word** | `{unaccent}{trailer}`


Note 1: This data originates from the file [`otext.tf`](https://github.com/CenterBLC/N1904/blob/main/tf/1.0.0/otext.tf):

> 
```
@config
...
@fmt:text-orig-full={before}{text}{after}
...
```


Note 2: The names of the available formats can also be obtaind by using the following call. However, this will not display the features that are included into the format. The function will return a list of ordered tuples that can easily be postprocessed:

In [11]:
T.formats

{'lex-orig-plain': 'word',
 'lex-translit-plain': 'word',
 'text-orig-full': 'word',
 'text-orig-plain': 'word',
 'text-translit-plain': 'word',
 'text-unaccent-plain': 'word'}

## 3.2 - Showcasing the various formats<a class="anchor" id="bullet3x2"></a>

This section will demonstrate the differences in how various text formats are displayed, using the verse Mark 1:1 as an example. To locate the corresponding verse node for Mark 1:1 in this dataset, the following command can be executed.

In [13]:
T.nodeFromSection(['Mark', 1, 1])

383782

The returned integer represents the numeric value of the verse node for Mark 1:1. This value can now be used in the following Python snippet to iterate through the defined text formats.

In [15]:
for formats in T.formats:
    print(f'fmt={formats}\t: {T.text(383782,formats)}')

fmt=lex-orig-plain	: ἀρχή ὁ εὐαγγέλιον Ἰησοῦς Χριστός υἱός θεός. 
fmt=lex-translit-plain	: arkhe o euaggelion Iesous Khristos uios theos. 
fmt=text-orig-full	: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ). 
fmt=text-orig-plain	: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ Υἱοῦ Θεοῦ. 
fmt=text-translit-plain	: Arkhe tou euaggeliou Iesou Khristou Uiou Theou. 
fmt=text-unaccent-plain	: Αρχη του ευαγγελιου Ιησου Χριστου Υιου Θεου. 


## 3.3 - Transliterated text<a class="anchor" id="bullet3x3"></a>

Using transliterated text can be convenient for crafting queries, as it allows you to use your regular keyboard without needing to input Greek characters. The following example query efficiently retrieves all occurrences of the Greek conjunction 'δὲ'

In [17]:
LatinQuery = '''
word translit=de
'''
Result = N1904.search(LatinQuery) 

from collections import Counter
# Initialize a counter to store word frequencies
word_counts = Counter()
# Loop through the results and count the occurrences of each word
for tuple in Result:
    word = F.text.v(tuple[0])
    word_counts[word] += 1
# Convert the counter into a list of tuples (word, frequency)
word_frequencies = word_counts.most_common()
# Print the word frequency table
print(f"{'Word':<20}{'Frequency'}")
print("-" * 30)
for word, freq in word_frequencies:
    print(f"{word:<20}{freq}")

  0.09s 2769 results
Word                Frequency
------------------------------
δὲ                  2620
δέ                  144
δὴ                  4
δή                  1


This example highlights the importance of careful use of transliteration. While the vast majority of the results match the expected word, an additional 5 results (approximately 0.18% of the total) correspond to a different - but sound-alike - word, the emphatic particle δὴ.

## 3.4 - Text with text critical markers<a class="anchor" id="bullet3x4"></a>

The base text of this Text-Fabric dataset is based upon the Nestle version or 1913, as explained on <a href="https://sites.google.com/site/nestle1904/faq" target="_blank">sites.google.com/site/nestle1904/faq</a>:

> *What are your sources?*
> For the text, I used the scanned books available at the Internet Archive (The first edition of 1904, and a reprinting from 1913 – the latter one has a better quality).

This version does have a limited amount of textual critical markers embedded in the base text. We have preserved this in text format 'text-orig-full', which can be printed using the following command. 

In [19]:
T.text(383782,fmt='text-orig-full')

'Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ). '

## 3.5 - Nestle version 1904 and version 1913 (Mark 1:1)<a class="anchor" id="bullet3x5"></a>

The previous result can be verified by examining the scans of the following printed versions:

* Nestle version 1904: <a href="https://archive.org/details/the-greek-new-testament-nestle-1904-us-edition/page/84/mode/2up" target="_blank">@ archive.org</a>
* Nestle version 1913: <a href="https://archive.org/details/hkainediathekete00lond/page/88/mode/1up" target="_blank">@ archive.org</a>

Or, in an image, placed side by side:

<img src="https://github.com/CenterBLC/N1904/blob/main/docs/tutorial/images/mark1v1_critical_marks.png?raw=true">

# 4 - Notebook version<a class="anchor" id="bullet4"></a>
##### [Back to TOC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>9 October 2024</td>
    </tr>
  </table>
</div>