# Identifying 'odd' characters for feature 'after' (N1904LFT)

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction]</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 - Showing the issue</a>
    * <a href="#bullet3x2">3.2 - Setting up a query to find them</a>
    * <a href="#bullet3x3">3.3 - Explanation of the regular expression</a>
    * <a href="#bullet3x4">3.4 - Bug</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

This Jupyter Notebook investigates the pressense of 'odd' values for feature 'after'. 

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# Loading the New Testament TextFabric code
# Note: it is assumed Text-Fabric is installed in your environment.

from tf.fabric import Fabric
from tf.app import use

In [4]:
# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())

**Locating corpus resources ...**

The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 not found


   |     0.30s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     3.07s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.01s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.58s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.70s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T after                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |      |     0.08s C __levels__           from otype, oslots, otext
   |      |     1.79s C __order__            from otype, oslots, __levels__
   |      |     0.08s C __rank__             from otype, __order__
   |      |     4.63s C __levUp__            from otype, oslots, __rank__
   |      |     2.7

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7943,17.35,100
sentence,12160,11.33,100
wg,132460,6.59,633
word,137779,1.0,100


# 3 - Performing the queries <a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

## 3.1 - Showing the issue <a class="anchor" id="bullet3x1"></a>
##### [Back to TOC](#TOC)

The following shows the pressence of a few 'odd' cases for feature 'after':

In [31]:
result = F.after.freqList()
print ('frequency: {0}'.format(result))

frequency: ((' ', 119271), (',', 9443), ('.', 5717), ('·', 2355), (';', 970), ('—', 7), ('ε', 3), ('ς', 3), ('ὶ', 2), ('ί', 1), ('α', 1), ('ι', 1), ('χ', 1), ('ἱ', 1), ('ὁ', 1), ('ὰ', 1), ('ὸ', 1))


## 3.2 - Setting up a query to find them <a class="anchor" id="bullet3x2"></a>
##### [Back to TOC](#TOC)

In [51]:
# Library to format table
from tabulate import tabulate

# The actual query
SearchOddAfters = '''
word after~^(?!([\s\.·—,;]))
    '''
OddAfterList = N1904.search(SearchOddAfters)

# Postprocess the query results
Results=[]
for tuple in OddAfterList:
    node=tuple[0]
    location="{} {}:{}".format(F.book.v(node),F.chapter.v(node),F.verse.v(node))
    result=(location,F.word.v(node),F.after.v(node))
    Results.append(result)
      
# Produce the table
headers = ["location","word","after"]
print(tabulate(Results, headers=headers, tablefmt='fancy_grid'))

  0.11s 16 results
╒═════════════════════╤══════════════╤═════════╕
│ location            │ word         │ after   │
╞═════════════════════╪══════════════╪═════════╡
│ Luke 23:51          │ —οὗτο        │ ς       │
├─────────────────────┼──────────────┼─────────┤
│ Luke 2:35           │ —κα          │ ὶ       │
├─────────────────────┼──────────────┼─────────┤
│ John 4:2            │ —καίτοιγ     │ ε       │
├─────────────────────┼──────────────┼─────────┤
│ John 7:22           │ —οὐ          │ χ       │
├─────────────────────┼──────────────┼─────────┤
│ Acts 22:2           │ —ἀκούσαντε   │ ς       │
├─────────────────────┼──────────────┼─────────┤
│ Romans 15:25        │ —νυν         │ ὶ       │
├─────────────────────┼──────────────┼─────────┤
│ I_Corinthians 9:15  │ —τ           │ ὸ       │
├─────────────────────┼──────────────┼─────────┤
│ II_Corinthians 12:2 │ —ἁρπαγέντ    │ α       │
├─────────────────────┼──────────────┼─────────┤
│ II_Corinthians 12:2 │ —εἴτ         │ ε       │
├

## 3.3 - Explanation of the regular expression <a class="anchor" id="bullet3x3"></a>
##### [Back to TOC](#TOC)

The regular expression broken down in its components:

`^`: This symbol is called a caret and represents the start of a string. It ensures that the following pattern is applied at the beginning of the string.

`(?!...)`: This is a negative lookahead assertion. It checks if the pattern inside the parentheses does not match at the current position.

`[…]`: This denotes a character class, which matches any single character that is within the brackets.

`[\s\.·,—,;]`: This character class contains multiple characters enclosed in the brackets. Let's break down the characters within it:

* `\s`: This is a shorthand character class that matches any whitespace character, including spaces, tabs, and newlines.
* `\.`: This matches a literal period (dot).
* `·`: This matches a specific Unicode character, which is a middle dot.
* `—`: This matches an em dash character.
* `,`: This matches a comma.
* `;`: This matches a semicolon.
In summary, the character class `[\s\.·,—,;]` matches any single character that is either a whitespace character, a period, a middle dot, an em dash, a comma, or a semicolon.

The regular expression selects any string which does not starts with a whitespace character, period, middle dot, em dash, comma, or semicolon.

The following site can be used to build and verify a regular expression: [regex101.com](https://regex101.com/) (choose the 'Pyton flavor') 

## 3.4 - Bug <a class="anchor" id="bullet3x4"></a>
##### [Back to TOC](#TOC)

The observed behaviour was due to a bug. [Issue tracker #76](https://github.com/Clear-Bible/macula-greek/issues/76) was opened. When the text of a node starts with punctuation, the @after attribute contains the last character of the word. This is a bug in the transformation to XML LowFat Tree data.