# Identify punctuations (Nestle1904LFT)

## Table of content <a class="anchor" id="TOC"></a> 
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 - Frequency of punctuations in corpus</a>
    * <a href="#bullet3x2">3.2 - Explanation of the Regular Expression</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>

This Jupyter Notebook performs some analysis regarding the various punctuations used in the corpus.

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use

In [3]:
# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())

**Locating corpus resources ...**

   |     0.21s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     2.46s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.61s T unicode              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.48s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.57s T wordtranslit         from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.60s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.59s T normalized           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.58s T wordunacc            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7943,17.35,100
sentence,8011,17.2,100
wg,113447,7.58,624
word,137779,1.0,100


App config error(s) in wg:
	label: feature rule not loaded


# 3 - Performing the queries <a class="anchor" id="bullet3"></a>

## 3.1 - Frequency of punctuations in corpus <a class="anchor" id="bullet3x1"></a>
##### [Back to TOC](#TOC)

This code generates a table that displays the frequency of punctuations behind words within the Text-Fabric corpus. The API call C.characters.data retrieves the data in the form of a Python dictionary. The subsequent code unpacks and sorts this dictionary to present the table. It's important to note that since the query is based on the 'word' feature, there are no spaces behind the words.

In [5]:
# Library to format table
from tabulate import tabulate

# The actual query (see section 3.2 about the used RegExp in this query)
SearchPunctuations = '''
word word~([\.·—,;])$
'''
PunctuationList = N1904.search(SearchPunctuations)

ResultDict = {}
for tuple in PunctuationList:
    node=tuple[0]
    Punctuation=F.word.v(node)[-1]  
    # Check if this Punctuation already exists in ResultDict
    if Punctuation in ResultDict:
        # If it exists, add the count to the existing value
        ResultDict[Punctuation]+=1
    else:
        # If it doesn't exist, initialize the count as the value
        ResultDict[Punctuation]=1

# Convert the dictionary into a list of key-value pairs
TableData = [[key, value] for key, value in ResultDict.items()]

# Produce the table
headers = ["Punctuation","Frequency"]
print(tabulate(TableData, headers=headers, tablefmt='fancy_grid'))


  0.12s 18507 results
╒═══════════════╤═════════════╕
│ Punctuation   │   Frequency │
╞═══════════════╪═════════════╡
│ .             │        5712 │
├───────────────┼─────────────┤
│ ,             │        9441 │
├───────────────┼─────────────┤
│ ·             │        2355 │
├───────────────┼─────────────┤
│ ;             │         969 │
├───────────────┼─────────────┤
│ —             │          30 │
╘═══════════════╧═════════════╛


## 3.2 Explanation of the Regular Expression <a class="anchor" id="bullet3x2"></a>
##### [Back to TOC](#TOC)

The regular expression `[\.·—,;]$` matches any one character from the set containing `.`, `·`, `—`, `,`, or `;`. The `$` anchor ensures that this character is at the end of the string. Hence, the regular expression will only be true if any of these characters is found at the last position of a word node. If the `$` anchor is omitted, there might be false positives due to the existence of 16 word nodes that start with the character `—`. 