<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Participles in the Hebrew Bible

## About

This task gives an inventory of particples and their context.
It is based on a request by Janet Dyk for data by which the verbal/nominal roles of particples can be studied.

It is work in progress. When we started we had not yet identified the exact set of features in the database that should give us clues.
So, in this notebook you'll find a number of attempts.

## Firing up LAF-Fabric

We fire up the engine, collect data, format the data and write it to a tab delimited file.

Then the LAF-Fabric task is completed.

After that we play around a bit with the data, see how we can visualize it with the python module *pandas*.

### Get a LAF processor

In [1]:
import sys
import collections
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.3.3
http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html


### Load data for this task

In [2]:
fabric.load('etcbc4', '--', 'participle', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        book chapter verse label
        g_word_utf8 g_cons_utf8
        vt
        sp pdp
        rela typ
        prs vbe
        g_prs
    ''','''
        functional_parent
    '''),
    "primary": False,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2014-07-14T16-45-08
  5.69s LOGFILE=/Users/dirk/laf-fabric-output/etcbc4/participle/__log__participle.txt
  5.69s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK participle AT 2014-07-15T16-40-04


# Data exploration

In this section we investigate the values that some features take in the database

## Part of speech tags

The following pos-tags were found in the features ``sp`` and ``pdp``.

For most word occurrences, these values coincide.
Where they differ, we output both, separated by a ``~``.

In [3]:
poss = set()
for i in NN(test=F.otype.v, values=['word']):
    pos = F.sp.v(i)
    pdpos = F.pdp.v(i)
    poss.add(pos)
    poss.add(pdpos)
poss

{'adjv',
 'advb',
 'art',
 'conj',
 'inrg',
 'intj',
 'nega',
 'nmpr',
 'prde',
 'prep',
 'prin',
 'prps',
 'subs',
 'verb'}

# Trees

Construct trees according to the parent relationship.
Make indexes for

* finding the sentence above each clause
* finding the words in each clause

In [4]:
top_nodes = set(NN(test=F.otype.v, value='sentence'))
msg("Top nodes found: {}".format(len(top_nodes)))

  6.93s Top nodes found: 66045


## Clause constituent relations

In [5]:
nodes_seen = set()
to_sentence = {}
clause_words = collections.defaultdict(lambda: set())
sentence_words = collections.defaultdict(lambda: set())
sentence_verse = {}
verse_label = None

def walk_tree(node, sentence, clause):
    if node in nodes_seen:
        return
    
    nodes_seen.add(node)
    to_sentence[node] = sentence
    new_clause = clause

    otype = F.otype.v(node)
    if otype == 'clause': new_clause = node
    if otype == 'word':
        clause_words[clause].add(node)
        sentence_words[sentence].add(node)
    
    children = Ci.functional_parent.v(node)
    for child in children:
        walk_tree(child, sentence, new_clause)

s = 0
sc = 0
chunk = 10000

for node in NN(top_nodes | set(NN(test=F.otype.v, value='verse'))):
    if F.otype.v(node) == 'verse':
        verse_label = F.label.v(node)
        continue
    sentence_verse[node] = verse_label
    nodes_seen = set()
    walk_tree(node, node, None)
    s += 1
    sc += 1
    if sc == chunk:
        msg("{} trees visited".format(s))
        sc = 0
    
msg("{} trees visited".format(s))

    11s 10000 trees visited
    12s 20000 trees visited
    13s 30000 trees visited
    14s 40000 trees visited
    14s 50000 trees visited
    15s 60000 trees visited
    16s 66045 trees visited


In [6]:
rels = collections.defaultdict(int)
verse_label = None
nr_of_examples = 3
examples = collections.defaultdict(lambda: set())
for i in NN(test=F.otype.v, values=['clause', 'verse']):
    if F.otype.v(i) == 'verse':
        verse_label = F.label.v(i)
    else:
        ccr = F.rela.v(i)
        rels[ccr] += 1
        if len(examples[ccr]) < nr_of_examples:
            examples[ccr].add(i)
for ccr in rels:
    print("{}: {:>6} x".format(ccr, rels[ccr]))

print("\n")

for ccr in sorted(examples):
    for clause in examples[ccr]:
        sentence = to_sentence[clause]
        cwords = sorted(clause_words[clause])
        swords = sorted(sentence_words[sentence])
        vlabel = sentence_verse[sentence]
#        print("{} in {}: {}".format(ccr, vlabel, " ".join([x[1] for x in P.data(i)])))
        print("{:<4} {:<10} {}\n{:<16}{}".format(
            ccr, 
            vlabel, 
            " ".join([F.g_word_utf8.v(word) for word in cwords]),
            '',
            " ".join([F.g_word_utf8.v(word) for word in swords]),
        ))
        

PrAd:      1 x
Subj:    436 x
Objc:   1347 x
Resu:   1193 x
Adju:   5872 x
RgRc:    198 x
Cmpl:    241 x
CoVo:    305 x
Spec:     41 x
Attr:   5930 x
NA:  69400 x
PreC:    156 x
Coor:   2858 x


Adju  GEN 01,16 לְ הָאִ֖יר עַל הָ אָֽרֶץ
                וַ יִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּ רְקִ֣יעַ הַ שָּׁמָ֑יִם לְ הָאִ֖יר עַל הָ אָֽרֶץ וְ לִ מְשֹׁל֙ בַּ  יֹּ֣ום וּ בַ  לַּ֔יְלָה וּֽ לֲ הַבְדִּ֔יל בֵּ֥ין הָ אֹ֖ור וּ בֵ֣ין הַ חֹ֑שֶׁךְ
Adju  GEN 01,15 לְ הָאִ֖יר עַל הָ אָ֑רֶץ
                וְ הָי֤וּ לִ מְאֹורֹת֙ בִּ רְקִ֣יעַ הַ שָּׁמַ֔יִם לְ הָאִ֖יר עַל הָ אָ֑רֶץ
Adju  GEN 01,14 לְ הַבְדִּ֕יל בֵּ֥ין הַ יֹּ֖ום וּ בֵ֣ין הַ לָּ֑יְלָה
                יְהִ֤י מְאֹרֹת֙ בִּ רְקִ֣יעַ הַ שָּׁמַ֔יִם לְ הַבְדִּ֕יל בֵּ֥ין הַ יֹּ֖ום וּ בֵ֣ין הַ לָּ֑יְלָה
Attr  GEN 01,07 אֲשֶׁ֖ר מֵ עַ֣ל לָ  רָקִ֑יעַ
                וַ יַּבְדֵּ֗ל בֵּ֤ין הַ מַּ֨יִם֙ אֲשֶׁר֙ מִ תַּ֣חַת לָ  רָקִ֔יעַ וּ בֵ֣ין הַ מַּ֔יִם אֲשֶׁ֖ר מֵ עַ֣ל לָ  רָקִ֑יעַ
Attr  GEN 01,11 מַזְרִ֣יעַ זֶ֔רַע
                תַּֽדְשֵׁ֤א הָ אָ֨רֶץ֙ דֶּ֔שֶׁא

In [7]:
ccrs = {
 'Adju': 'adjunct clause',
 'Attr': 'attributive clause',
 'Cmpl': 'complement clause, but not subject or object',
 'CoVo': 'continuation of the vocative',
 'Coor': 'coordination',
 'Objc': 'object clause',
 'PrAd': 'predicative adjunct clause',
 'PreC': 'predicative complement clause',
 'Resu': 'clause after resumptive extrapolated fronted element',
 'RgRc': 'Regens rectum (governing governed)',
 'Spec': 'Specification clause',
 'Subj': 'subject clause',
 'NA': 'not known/not marked',
}

### Subject
clause that has the function of subject

### Object
clause that has the function of object

### Complement
clause that has a function of a verb complement, but not subject or object 

### Attributive
clause that has an attributive function (often with a relative pronoun)

### Adjunct
clauses with additional information, usually without a finite verb

### Predicative clause
clause that has a predicative function

### Coordination
multiple dependent clauses coordinated (with and, or etc) to each other under the same head (a main clause or a phrase (asher))

### Continuation of the vocative
clause that follows after a vocative: *Adam*, where are you.

### Resumptive
King David, Nathan the prohpet spoke severly to *him* [here King David is *casus pendens* or extrapolated element.

### Regens Rectum
You shall reign over the birds and the animals and **all** *creeps on the face of the earth* [Here **all** governs the *reptiles*]

### none
No clause constituent relation marked.


## Phrase atom types

In [8]:
pats = set()
for i in NN(test=F.otype.v, values=['phrase_atom']):
    pat = F.typ.v(i)
    pats.add(pat)
pats

{'AdjP',
 'AdvP',
 'CP',
 'DPrP',
 'IPrP',
 'InjP',
 'InrP',
 'NP',
 'NegP',
 'PP',
 'PPrP',
 'PrNP',
 'VP'}

## Tense

In [9]:
tenses = set()
for i in NN(test=F.otype.v, values=['word']):
    tense = F.vt.v(i)
    tenses.add(tense)
tenses

{'NA', 'impf', 'impv', 'infa', 'infc', 'perf', 'ptca', 'ptcp', 'wayq'}

## Pronominal suffixes (paradigmatic, graphical, plain)

In [10]:
ppss = set()
gpss = set()
for i in NN(test=F.otype.v, values=['word']):
    pps = F.prs.v(i)
    gps = F.g_prs.v(i)
    ppss.add(pps)
    gpss.add(gps)
    
output = "paradigmatic pronoun suffix: "
output += ", ".join(sorted(ppss))
output += "\n" + "graphical pronoun suffix: "
output += ", ".join(sorted(gpss))
print(output)

paradigmatic pronoun suffix: H, H=, HJ, HM, HN, HW, HWN, J, K, K=, KM, KN, KWN, M, MW, N, N>, NJ, NW, W, absent, n/a
graphical pronoun suffix: , +, +:@K@, +:AHOM, +:AHOWN, +:AK@, +:AK@H, +:AKEM, +:AKOM, +:AKOWN, +:H@, +:HEM, +:HEN, +:HOM, +:HOWM, +:HOWN, +:HW., +:K@, +:K@H, +:KEM, +:KEN, +:KOM, +:KOWN, +:NIJ, +:NW., +;>, +;H., +;HW., +;K, +;K:, +;K;H, +;KIJ, +;M, +;MOW, +;N.@H, +;NIJ, +;NW., +>, +@>, +@H, +@H,, +@H., +@H:N@H, +@H;M, +@H;N, +@HAM, +@HEM, +@HEN, +@HW., +@K:, +@K@H, +@KEM, +@KEN@H, +@M, +@MOW, +@N, +@N@H, +@NIJ, +@NW., +A, +AH., +AJ, +AM, +AN, +AN.IJ, +AN@>, +ANIJ, +D, +EH@, +EK:, +EK@, +EK@H, +EM, +EN@H, +H, +H., +H.EM, +H;M@H, +H;N, +H>, +H@, +HEM, +HEN, +HEN@H, +HIJ, +HM, +HOM, +HOWN, +HW, +HW., +HWN, +IJ, +IK, +IK:, +J, +JNJ, +K, +K.@, +K.@H, +K.EM, +K:, +K@, +K@H, +KEM, +KEN, +KEN@H, +KIJ, +KJ, +KM, +KOWN, +M, +MOW, +MW., +N, +N>, +N@>, +NH, +NIJ, +NJ, +NW, +NW., +OH, +OW, +W, +W., +WMW, +WNJ, +WW


## Verbal ending (paradigmatic)

In [11]:
pves = set()
for i in NN(test=F.otype.v, value='word'):
    pve = F.vbe.v(i)
    pves.add(pve)
pves

{'',
 'H',
 'H=',
 'J',
 'JN',
 'N',
 'N>',
 'NH',
 'NW',
 'T',
 'T=',
 'T==',
 'TJ',
 'TM',
 'TN',
 'TWN',
 'W',
 'WN',
 'n/a'}

# Conventions for all tasks

In [12]:
pos_table = {
 'adjv': 'aj',
 'advb': 'av',
 'art': 'dt',
 'conj': 'cj',
 'intj': 'ij',
 'inrg': 'ir',
 'nega': 'ng',
 'subs': 'n',
 'nmpr': 'n-pr',
 'prep': 'pp',
 'prps': 'pr-ps',
 'prde': 'pr-dem',
 'prin': 'pr-int',
 'verb': 'vb',
}

pron_suffix_table = {
 '',
 'H',
 'H=',
 'J',
 'JN',
 'N',
 'N>',
 'NH',
 'NW',
 'T',
 'T=',
 'T==',
 'TJ',
 'TM',
 'TN',
 'TWN',
 'W',
 'WN',
 'n/a',
}

# Task: Participles in Clause Atoms

## Specification

We want to analyse participles in their clause-*atoms*, not in their full clauses.
We are particularly interested in verbal complements that these participles have in their clause-atom.

Let us start with pronominal suffixes attached to the participle.

We need to find all words marked with ``tense=ptca`` or ``tense=ptcp``.
From there, we need all surrounding words in the same clause-*atom*.
Of all words, we need the ``sp`` and the ``pdp`` features,
and of the participle we need the ``prs`` as well.

We output a tab delimited file.

One row per participle, containing the following fields:

sequence number | passage label | 

pos-tags of words before | pos tag of ptc | pronoun suffix (paradigmatic) | pos-tags of words after | 

plain text of words after | pronoun suffix (plain) | plain text of ptc | plain text of words before

Every participle is shown within its clause-atom.

If there are several participles in the same clause-atom, we put every participle in a separate row.

## Execute the task: data collection

In [14]:
msg("Get the participles...")

book = None
chapter = None
verse = None
label = None

found_total = 0
found_in_book = 0
found_total = 0
clause_atoms = []
current_clause = []
has_participle = None

for i in NN(test=F.otype.v, values=['book', 'chapter', 'verse', 'clause_atom', 'word']):
    otype = F.otype.v(i)
    if otype == 'word':
        tense = F.vt.v(i)
        is_participle = tense == 'ptca' or tense == 'ptcp'
        pron_suff_para = None
        if is_participle:
            has_participle = True
            pron_suff_para = F.prs.v(i)
            found_total += 1
        pos = pos_table[F.sp.v(i)]
        pdpos = pos_table[F.pdp.v(i)]
        rpos = pos if pos == pdpos else "{}~{}".format(pos, pdpos)
        current_clause.append((
            is_participle,
            F.g_cons_utf8.v(i),
            rpos,
            pron_suff_para,
        ))
    elif otype == 'clause_atom':
        if has_participle:
            clause_atoms.append((label, current_clause))
            found_in_book += 1
        current_clause = []
        has_participle = False
    elif otype == 'book':
        if book != None:
            msg("{} ({})".format(book, found_in_book), withtime=False)
            found_in_book = 0
        book = F.book.v(i)
    elif otype == 'chapter':
        chapter = F.chapter.v(i)
    elif otype == 'verse':
        verse = F.verse.v(i)
        label = "{} {}:{}".format(book, chapter, verse)
if has_participle:
    clause_atoms.append((label, current_clause))
msg("{} ({})".format(book, found_in_book), withtime=False)

msg("Found {} participles in {} clause atoms".format(found_total, len(clause_atoms)))

 1m 10s Get the participles...
Genesis (354)
Exodus (340)
Leviticus (232)
Numeri (383)
Deuteronomium (435)
Josua (187)
Judices (227)
Samuel_I (325)
Samuel_II (251)
Reges_I (287)
Reges_II (290)
Jesaia (807)
Jeremia (743)
Ezechiel (505)
Hosea (71)
Joel (28)
Amos (82)
Obadia (7)
Jona (15)
Micha (72)
Nahum (48)
Habakuk (28)
Zephania (44)
Haggai (10)
Sacharia (132)
Maleachi (51)
Psalmi (907)
Iob (217)
Proverbia (490)
Ruth (34)
Canticum (63)
Ecclesiastes (126)
Threni (58)
Esther (108)
Daniel (335)
Esra (110)
Nehemia (204)
Chronica_I (193)
Chronica_II (355)
 1m 14s Found 9664 participles in 9154 clause atoms


## Execute the task: formatting output

In [15]:
split_clause_atoms = []
for (label, clause) in clause_atoms:
    ptcs = [n for (n, w) in enumerate(clause) if w[0]]
    for ptc in ptcs:
        split_clause_atoms.append((
            label,
            clause[0:ptc],
            clause[ptc],
            clause[ptc+1:len(clause)] if ptc < len(clause) - 1 else [],
        ))

In [16]:
ptc_cl_atoms = outfile("ptc_cl_atoms.csv")
ptc_cl_atoms.write("n\tpassage\tp_pre\tp_ptc\tp_suff\tp_post\tt_post\tt_ptc\tt_pre\n")
for (n, (label, pre, ptc, post)) in enumerate(split_clause_atoms):
    fields = [str(n+1), label]
    fields.append("|".join([w[2] for w in pre]))
    fields.append(ptc[2])
    fields.append(ptc[3])
    fields.append("|".join([w[2] for w in post]))
    fields.append(" ".join([w[1] for w in post]))
    fields.append(ptc[1])
    fields.append(" ".join([w[1] for w in pre]))
    ptc_cl_atoms.write("{}\n".format("\t".join(fields)))
close()

 1m 20s Results directory:
/Users/dirk/laf-fabric-output/etcbc4/participle

__log__participle.txt                  1084 Tue Jul 15 18:41:24 2014
ptc_cl_atoms.csv                     825100 Tue Jul 15 18:41:24 2014


## Playing with the output

First of all: I opened the ``ptc_tab.csv`` (a tab delimited file) in OpenOffice, and there I formatted some rows and columns, defined a region, and sorted the rows. The result I saved in ``ptc_tab.ods`` (also on GitHub, same directory as this notebook).

Let's get an impression of what we've got in our tab delimited file.

In [17]:
%matplotlib inline
import pandas
from IPython.display import display
pandas.set_option('display.notebook_repr_html', True)

In [18]:
table_file = my_file('ptc_cl_atoms.csv')
df = pandas.read_csv(table_file, sep="\t", keep_default_na=False, na_values=[])
df.head(10)

Unnamed: 0,n,passage,p_pre,p_ptc,p_suff,p_post,t_post,t_ptc,t_pre
0,1,Genesis 1:3,cj|n|n,vb,absent,pp|n|dt|n,על פני ה מים,מרחפת,ו רוח אלהים
1,2,Genesis 1:7,cj|vb,vb,absent,n~pp|n|pp|n,בין מים ל מים,מבדיל,ו יהי
2,3,Genesis 1:11,,vb,absent,n,זרע,מזריע,
3,4,Genesis 1:11,,vb,absent,n|pp|n,פרי ל מינו,עשׂה,
4,5,Genesis 1:12,,vb,absent,n|pp|n,זרע ל מינהו,מזריע,
5,6,Genesis 1:12,,vb,absent,n,פרי,עשׂה,
6,7,Genesis 1:21,dt~cj,vb,absent,,,רמשׂת,ה
7,8,Genesis 1:27,dt~cj,vb,absent,pp|dt|n,על ה ארץ,רמשׂ,ה
8,9,Genesis 1:29,dt~cj,vb,absent,pp|dt|n,על ה ארץ,רמשׂת,ה
9,10,Genesis 1:29,,vb,absent,n,זרע,זרע,
