# Preparing participant-tracking data set for analysis

*Christian Højgaard Jensen, chj@dbi.edu*

This notebook contains the scripts necessary for preparing [Eep Talstra's dataset](https://github.com/ch-jensen/Talstra-participant-tracking/blob/master/lev17to26.PredFrCSV) for analysis. The file resembles a semicolon-separated CSV-file but needs to stripped for superflous white spaces.

The major part of the notebook contains a mapping of the dataset with the clause atom nodes and word nodes of the [ETCBC database of the Hebrew Bible](https://github.com/ETCBC/bhsa). The mapping allows for validating the quality of the dataset as well as combining the data with the data of the ETCBC database.

In [2]:
import os, sys
import csv, re
import pandas as pd
import copy
import pprint as pp

## 1. Importing the dataset

In [3]:
filename = 'lev17to26.PredFrCSV'

new_dict = {}

n = 0
with open(filename) as f:
    next(f)
    reader = csv.reader(f, delimiter=';')
    for r in reader:
        if r[0].strip(' ') != 'PTC':
            ref = r[0].strip(' ')
            surface_text = r[1].lstrip(' ').rstrip(' ')
            book = r[2].strip(' ')
            chapter = r[3].strip(' ')
            verse = r[4].strip(' ')
            line = r[5].strip(' ')
            pred = r[6].strip(' ')
            VPhr = r[7].lstrip(' ').rstrip(' ')
            ptc_lex = r[8].lstrip(' ').rstrip(' ')
            ptc_actor = r[9].lstrip(' ').rstrip(' ')
            first_lex = r[10].strip(' ')
            last_lex = r[11].strip(' ')
            const_parsing = r[12].lstrip(' ').rstrip(' ') #Constituent parsing
            n+=1

            new_dict[n] = [ref, surface_text, book, chapter, verse, line, pred, VPhr,
                           ptc_lex, ptc_actor, first_lex, last_lex, const_parsing]

print(f'Length of dataset: {len(new_dict)}')

Length of dataset: 4092


Sample of dataset:

In [4]:
data = pd.DataFrame.from_dict(new_dict).T
data.columns = ['ref','surface text', 'book','chapter','verse','line','pred','ref lex', 'part. set', 'actor', 'first slot',
                'last slot', 'func']
data[:10]

Unnamed: 0,ref,surface text,book,chapter,verse,line,pred,ref lex,part. set,actor,first slot,last slot,func
1,1,JDBR,leviticus,17,1,1,DBR,DBR,3sm=JHWH,JHWH,2,2,VbPred
2,2,JHWH,leviticus,17,1,1,DBR,JHWH,3sm=JHWH,JHWH,3,3,Subj
3,3,>L MCH,leviticus,17,1,1,DBR,>L MCH,0sm=MCH,MCH,4,5,Compl1
4,4,L->MR,leviticus,17,1,2,>MR,L >MR,3sm=JHWH,JHWH,1,2,VbPred
5,5,DBR,leviticus,17,2,3,DBR,DBR,2sm=,MCH,1,1,VbPred
6,6,>L >HRN W->L BNJW W->L KL BNJ JFR>L,leviticus,17,2,3,DBR,>L >HRN W >L BN+S W >L KL BN JFR>L,3pm=>HRN BN+>HRN FR>L,>HRN BN >HRN,2,11,Compl1
7,7,>L >HRN W->L BNJW,leviticus,17,2,3,DBR,>L >HRN W >L BN+S,...,...,2,6,-paral
8,8,>L >HRN,leviticus,17,2,3,DBR,>L >HRN,3sm=>HRN,>HRN,2,3,-paral
9,9,>L BNJW,leviticus,17,2,3,DBR,>L BN+312,...,...,5,6,-paral
10,10,sfx:W,leviticus,17,2,3,DBR,sfx,3sm=>HRN,>HRN,6,6,-gentf


## 2. Verifying chapters

Does the dataset contain all chapters (chapters 17-26 in Leviticus)?

In [5]:
chapters = set()
for k in new_dict:
    chapters.add(new_dict[k][3])
    
print(sorted(chapters))

['17', '18', '19', '20', '21', '22', '23', '24', '25', '26']


## 3. Clause and word mapping

Having succesfully imported the dataset and converted it to a dictionary, the columns "line" and "first slot" and "last slot" need to mapped to the absolute clause atom nodes and word nodes of the ETCBC database.

The Python package [text-fabric](https://dans-labs.github.io/text-fabric/) is imported. Text-fabric is a representation of the database and comes with neat functions for navigating the database and extracting data.

In [6]:
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa

BHSA = f'etcbc/bhsa/tf/c'
TF = Fabric(modules = BHSA)

This is Text-Fabric 6.2.1
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored


In [7]:
api = TF.load('''
    typ function 
    sp lex
''', silent=True)
api.loadLog()
api.makeAvailableIn(globals())

   |     0.08s B otype                from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.68s B oslots               from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.03s B book                 from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.02s B chapter              from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.02s B verse                from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.17s B g_cons               from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.24s B g_cons_utf8          from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.17s B g_lex                from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.24s B g_lex_utf8           from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.17s B g_word               from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0.26s B g_word_utf8          from C:/Users/Ejer/text-fabric-data/etcbc/bhsa/tf/c
   |     0

[('computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('loading', ('TF', 'ensureLoaded', 'ignored', 'loadLog')),
 ('locality', ('L Locality',)),
 ('messaging', ('cache', 'error', 'indent', 'info', 'reset')),
 ('navigating-nodes', ('N Nodes', 'sortKey', 'otypeRank', 'sortNodes')),
 ('node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('searching', ('S Search',)),
 ('text', ('T Text',))]

In [8]:
B = Bhsa(api, 'search', version="c")

**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of this corpus">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 6.2.1</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

### 3.1 Clause mapping

The line numbers in the data set corresponds to clause atoms but are relative to the chapter they appear in. To map those relative line number with the absolute clause atom nodes of the ETCBC database we first extract the first clause atom of the particular chapter the line appears in. This first clause atom is used as the basis to calculate the clause atom node of the line itself.

In [9]:
clause_map_dict = copy.deepcopy(new_dict)

for l in clause_map_dict:
    book = clause_map_dict[l][2].capitalize()
    chapter = int(clause_map_dict[l][3])
    first_clause_atom = L.d(T.nodeFromSection((book, chapter,)), 'clause_atom')[0]
    
    line = int(clause_map_dict[l][5])
    clause_atom = first_clause_atom + line - 1
    
    verse = T.sectionFromNode(clause_atom)[2] #The clause atom is used to extract the precise verse
    
    clause_map_dict[l][2] = book #The book column is updated with capitalized book names
    clause_map_dict[l][4] = str(verse)
    clause_map_dict[l][5] = str(clause_atom)

In [10]:
# clause_map_dict

### 3.2. Verifying the clause mapping

To verify the clause mapping, each clause is matched against its predicate (if any). It is assumed that if the predicate lexeme deduced from the clause number matches the predicate lexeme in the dictionary, the clause number has been assigned correctly.

If the clause is a nominal clause, no predicate will be found, but also in these cases there will be a match, since the predicate lexeme in the dictionary would also be empty.

In [11]:
error_count = 0
clause_errors = []

for l in clause_map_dict:
    clause = int(clause_map_dict[l][5])
    pred = clause_map_dict[l][6]
    
    for ph in L.d(clause, 'phrase'):
        if F.function.v(ph) == "Pred":
            for w in L.d(ph, 'word'):
                if F.sp.v(w) == 'verb':
                    lex = F.lex.v(w).rstrip('[').rstrip('=')
                    if lex != pred:
                        error_count += 1
                        clause_errors.append(l)
                        
print(f'Error count: {error_count} mismatch(es) have been identifed')

Error count: 28 mismatch(es) have been identifed


The 28 mismatches can be explained by the fact that we only compare the verbal lexeme extracted from the predicate clause with the entire predicate phrase of the data set. In some cases, for instance with infinitive construct, the predicate phrase is a compound of both verbs and prepositions. This phenomenon will prompt a mismatch as we see below.

In [12]:
def show_clause_errors(errors):
    for k in errors:
        print(f'line {k}: predicate lexeme in data set: {clause_map_dict[k][6]}')
        clause_atom = int(clause_map_dict[k][5])
        phrases = L.d(clause_atom, 'phrase')
        for ph in phrases:
            if F.typ.v(ph) == 'VP':
                VP = ph        
        B.pretty(clause_atom, withNodes=True, highlights = {VP:'gold'})
        
show_clause_errors(clause_errors[:10])

line 459: predicate lexeme in data set: L YRR


line 578: predicate lexeme in data set: L BLT <


line 579: predicate lexeme in data set: L BLT <


line 580: predicate lexeme in data set: L BLT <


line 675: predicate lexeme in data set: L QYR


line 1031: predicate lexeme in data set: LM<N VM


line 1032: predicate lexeme in data set: LM<N VM


line 1033: predicate lexeme in data set: LM<N VM


line 1049: predicate lexeme in data set: L BLT M


line 1050: predicate lexeme in data set: L BLT M


### 3.3. Word mapping

The data set has two columns that contain the positions of the first and the last word of the participant reference, respectively. These word positions are relative to the clause atom within which they occur and the absolute word node of these positions can therefore be calculated. The first step is to find the first word node of the clause atom node that we have previously determined (see above under clause mapping). This first word node is then used as the basis for calculating the word nodes of the first and the last word position:

In [13]:
word_map_dict = copy.deepcopy(clause_map_dict)

for l in word_map_dict:
    clause_atom = int(word_map_dict[l][5])
    first_w = L.d(clause_atom, 'word')[0]
    
    first_lex = int(word_map_dict[l][10])
    last_lex = int(word_map_dict[l][11])
    
    first_word = first_w + first_lex - 1
    last_word = first_w + last_lex - 1
    
    word_map_dict[l][10] = str(first_word)
    word_map_dict[l][11] = str(last_word)

In [14]:
word_map_dict

{1: ['1',
  'JDBR',
  'Leviticus',
  '17',
  '1',
  '528163',
  'DBR',
  'DBR',
  '3sm=JHWH',
  'JHWH',
  '63009',
  '63009',
  'VbPred'],
 2: ['2',
  'JHWH',
  'Leviticus',
  '17',
  '1',
  '528163',
  'DBR',
  'JHWH',
  '3sm=JHWH',
  'JHWH',
  '63010',
  '63010',
  'Subj'],
 3: ['3',
  '>L MCH',
  'Leviticus',
  '17',
  '1',
  '528163',
  'DBR',
  '>L MCH',
  '0sm=MCH',
  'MCH',
  '63011',
  '63012',
  'Compl1'],
 4: ['4',
  'L->MR',
  'Leviticus',
  '17',
  '1',
  '528164',
  '>MR',
  'L >MR',
  '3sm=JHWH',
  'JHWH',
  '63013',
  '63014',
  'VbPred'],
 5: ['5',
  'DBR',
  'Leviticus',
  '17',
  '2',
  '528165',
  'DBR',
  'DBR',
  '2sm=',
  'MCH',
  '63015',
  '63015',
  'VbPred'],
 6: ['6',
  '>L >HRN W->L BNJW W->L KL BNJ JFR>L',
  'Leviticus',
  '17',
  '2',
  '528165',
  'DBR',
  '>L >HRN W >L BN+S W >L KL BN JFR>L',
  '3pm=>HRN BN+>HRN FR>L',
  '>HRN BN >HRN',
  '63016',
  '63025',
  'Compl1'],
 7: ['7',
  '>L >HRN W->L BNJW',
  'Leviticus',
  '17',
  '2',
  '528165',
  'DBR',


### 3.4. Verifying word mapping

One way to check the accuracy of the word mapping is to cross check the word nodes assigned to the data set against the function of those phrases in which they occur in the data set and in the ETCBC database, respectively. Only the most frequent and important phrase functions have been selected for this test, namely predicate phrases (including predicates with object and subject suffix), subjects (including predicates with subject suffix), predicate complements.

In [16]:
error_count = 0
word_errors = []

for l in word_map_dict:
    first_word_func = F.function.v(L.u(int(word_map_dict[l][10]), 'phrase')[0])
    last_word_func = F.function.v(L.u(int(word_map_dict[l][11]), 'phrase')[0])
    func = word_map_dict[l][12]
    
    if func == 'VbPred':
        if first_word_func not in ['Pred','PreO','PreS']:
            error_count += 1
            word_errors.append(l)
        if last_word_func not in ['Pred','PreO','PreS']:
            error_count += 1
            word_errors.append(l)
            
    elif func == 'Subj':
        if first_word_func not in ['Subj', 'PreS']:
            error_count += 1
            word_errors.append(l)
        if last_word_func not in ['Subj', 'PreS']:
            error_count += 1
            word_errors.append(l)
            
    elif func == 'PrCompl':
        if first_word_func != 'PreC':
            error_count += 1
            word_errors.append(l)
        if last_word_func != 'PreC':
            error_count += 1
            word_errors.append(l)
    
print(f'Error count: {error_count} mismatch(es) have been identifed')

Error count: 4 mismatch(es) have been identifed


The 4 mismatches do not indicate that the mapping is inaccurate. Rather, as shown below, the functions of the phrases in question have been interpreted differently and therfore causing a mismatch.

In [18]:
def show_word_errors(errors):
    for k in errors:
        print(f'line {k}: Function in data set: {word_map_dict[k][12]}')
        clause_atom = int(word_map_dict[k][5])
        B.pretty(clause_atom, withNodes=True, highlights = {int(word_map_dict[k][10]):'hotpink',
                                                            int(word_map_dict[k][11]):'hotpink'})
        
show_word_errors(word_errors)

line 1439: Function in data set: Subj


line 1439: Function in data set: Subj


line 1904: Function in data set: PrCompl


line 1904: Function in data set: PrCompl


## 4. Verifying lengths of cell

In earlier versions of the dataset, the individual column cells had deliberately been stipulated a maximum length. This posed a problem for further analysis of the dataset because values were missing.

Therefore, it is necessary to check if the cells have the right lengths. The most important cells in this regard is the columns containing participant references.

The procedure is to sort the participant references according to length and manually check whether the longest references are intact:

In [16]:
actor_lex_sorted = sorted(word_map_dict, key=lambda k: len(word_map_dict[k][9]), reverse=True)

for k in actor_lex_sorted[:10]:
    clause_atom = int(word_map_dict[k][5])
    print('''Clause atom: {}\nParticipant ref: {}'''.format(clause_atom, word_map_dict[k][9]))
    B.shbLink(clause_atom)
    B.pretty(clause_atom)

Clause atom: 528769
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528770
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528771
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528777
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528778
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528778
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528779
Participant ref: <WRT CBWR XRWY JBLT GRB JLPT


Clause atom: 528834
Participant ref: <MR R>CJT QYJR BN "YOUPlmas"


Clause atom: 528835
Participant ref: <MR R>CJT QYJR BN "YOUPlmas"


Clause atom: 528836
Participant ref: <MR R>CJT QYJR BN "YOUPlmas"


We see that the longest participant reference occurs in Lev 22:22 and is also referred to in Lev 22:25. The representation of the participant reference matches the nominal phrase perfectly and contains all substantives of the phrase.

In [17]:
B.shbLink(528769)
B.pretty(L.u(528769, 'verse')[0], highlights = {L.d(528769, 'phrase')[0]: 'gold', 65771: 'gold', 65778: 'gold'})

## 5. Export

Finally, the dataset can be exported, ready for further analysis.

### 5.1 Exporting original data in CSV-format:

The original data (not mapped with the ETCBC clause atom nodes and word nodes) is exported in a standard CSV-format:

In [18]:
header = ['line','text','book','chapter','verse','clause_atom','predicate',
          'reference','participant','actor','first_word','last_word','func']

In [19]:
with open('Datasets/Lev17toLev26.orig.csv', 'w') as f:
    f.write('''{}\n'''.format(','.join(header)))
    for l in new_dict:
        f.write('''{}\n'''.format(','.join(new_dict[l])))

### 5.2 Exporting mapped data in CSV-format:

Before exporting the mapped data, the columns "first slot" and "last slot" are substituted with a column containing the range of words between the first slot and the last slot:

In [20]:
#Run only once!
for k in word_map_dict:
    info = word_map_dict[k]
    first_word = int(info[10])
    last_word = int(info[11])
    word_list = [str(w) for w in range(first_word, last_word + 1)]
    update_info = info.insert(12, ' '.join(word_list))

In [21]:
#word_map_dict

In [22]:
header = ['line','text','book','chapter','verse','clause_atom','predicate',
          'reference','participant','actor','slots','func']

In [23]:
with open('Datasets/Lev17toLev26.mapped.csv', 'w') as f:
    f.write('''{}\n'''.format(','.join(header)))
    for l in word_map_dict:
        f.write('''{},{}\n'''.format(','.join(word_map_dict[l][0:10]),','.join(word_map_dict[l][12:14])))