# Occurrences

We want to make a list of all occurrences of אֱלֹהִים with the function of the phrase they occur in.

We define a function, that given a lexeme, produces a tab separated file of the occurrences of that lexeme throughout the Hebrew Bible.

1. passage label
1. phrase node
1. phrase text
1. phrase gloss
1. phrase function
1. lexeme
1. occurrence text
1. occurrence node

We apply that function to the lexeme אֱלֹהִים (in ETCBC encoding: >LHJM/) to generate two concrete output files:

1. with Hebrew text represented in ETCBC consonantal transcription, for ease of importing it in Excel.
2. with fully pointed Hebrew text (works best in OpenOffice or Numbers)

In [1]:
import sys, os
import collections

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()

 0.00s This is LAF-Fabric 4.7.2
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



# Loading the feature data

In [2]:
version = '4b'
API = fabric.load('etcbc{}'.format(version), 'lexicon', 'adjectives', {
 "xmlids": {"node": False, "edge": False},
 "features": ('''
 otype 
 function lex
 gloss
 ''',
 '''
 '''),
 "prepare": prepare,
 "primary": False,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

 0.00s LOADING API: please wait ... 
 0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE
 0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
 0.01s DETAIL: COMPILING a: lexicon: UP TO DATE
 0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54
 0.01s DETAIL: load main: G.node_anchor_min
 0.14s DETAIL: load main: G.node_anchor_max
 0.24s DETAIL: load main: G.node_sort
 0.34s DETAIL: load main: G.node_sort_inv
 0.83s DETAIL: load main: G.edges_from
 0.95s DETAIL: load main: G.edges_to
 1.14s DETAIL: load main: F.etcbc4_db_otype [node] 
 2.20s DETAIL: load main: F.etcbc4_ft_function [node] 
 2.33s DETAIL: load main: F.etcbc4_ft_lex [node] 
 2.54s DETAIL: load annox lexicon: F.etcbc4_lex_gloss [node] 
 2.80s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/adjectives/__log__adjectives.txt
 2.83s INFO: LOADING PREPARED data: please wait ... 
 2.83s prep prep: G.node_sort
 2.92s prep prep: G.node_sort_inv
 3.43s prep prep: L.node_up
 7.27s prep prep: L.node_down
 13s pre

# Collect data

We make an index between a lexeme and all its occurrences. The index takes the shape of a dictionary with the lexemes as keys and the set of its occurrences as values. The lexeme is represented in the ETCBC transcription (we take the value of the `lex` features, and the occurrences are represented by just their nodes (which are plain integers).

In [4]:
occurrences = collections.defaultdict(lambda: set())
# a defaultdict is needed for the case where we see a lexeme for the first time.
# In that case occurrences[lexeme] does not yet exist.
# The defaultdict then inserts the key lexeme with value empty set into the dict.
inf('Making occurrence index ...')
for w in F.otype.s('word'):
 occurrences[F.lex.v(w)].add(w)
inf('{} lexemes'.format(len(occurrences)))

 6m 30s Making occurrence index ...
 6m 31s 8777 lexemes


# Retrieve the relevant bits

Given the node of an occurrence, we gather all required information without much ado, and assemble it into a tuple. Because we want two output formats, we define a function that takes a format parameter, `ec` (=ETCBC consonantal) or `ha` (=fully pointed Hebrew). See the ETCBC-reference (follow the link in the cell above that has loaded the data, and look for the `T` function.

In [42]:
def bits(fmt, w):
 p = L.u('phrase', w)
 pw = list(L.d('word', p))
 return (
 T.passage(w),
 p,
 T.words(pw, fmt=fmt).replace('\n', ' '),
 ' '.join(F.gloss.v(x) for x in pw),
 F.function.v(p),
 F.lex.v(w),
 T.words([w], fmt=fmt).replace('\n', ' '),
 w,
 )

# Generating output

Let us now assemble all data into the final output.
We produce also a row of column headers.
And we produce some statistics.

In [43]:
fields = '''
 passage
 phrase_node
 phrase_text
 phrase_gloss
 phrase_function
 lexeme
 occ_text
 occ_node
'''.strip().split()
nfields = len(fields)
row_template = ('{}\t' * (nfields - 1))+'{}\n'
of_path_template = 'occurrences_{}.{}.csv'

The function that writes the file, given lexeme and format, and a function to produce statistics, given a lexeme.

In [46]:
def lex_file_name(lexeme):
 # in order to use the lexeme in a file name, we replace < > / [ = by harmless characters
 return lexeme.\
 replace('/', 's').\
 replace('[', 'v').\
 replace('=', 'x').\
 replace('<', 'o').\
 replace('>', 'a')

def lex_info(lexeme, fmt):
 file_lex = lex_file_name(lexeme)
 file_name = of_path_template.format(file_lex, fmt)
 of = open(file_name, 'w')
 of.write('{}\n'.format('\t'.join(fields)))
 if lexeme not in occurrences:
 msg('There is no lexeme "{}"'.format(lexeme))
 occs = []
 else:
 occs = sorted(occurrences[lexeme], key=NK)
 # sorted turns a set into a list. The order is given by the key parameter.
 # This is the function NK (see the ETCBC-reference. It orders nodes
 # according to where their associated text occurs in the Bible
 for w in occs:
 of.write(row_template.format(*bits(fmt, w)))
 # bits yields a tuple of values. The * unpacks this tuple in separate arguments.
 of.close()
 inf('Written {} lines to {}'.format(len(occs) + 1, file_name))

def show_stats(lexeme):
 # we produce an overview of the distribution of the occurrences over the books
 # book names in Swahili
 book_dist = collections.Counter()
 if lexeme not in occurrences:
 msg('There is no lexeme "{}"'.format(lexeme))
 occs = []
 else:
 occs = sorted(occurrences[lexeme], key=NK)
 for w in occs:
 book_node = L.u('book', w)
 book_name_sw = T.book_name(book_node, lang='sw')
 book_name = T.book_name(book_node)
 book_dist['{:<30} = {}'.format(book_name_sw, book_name)] += 1
 # we sort the results by frequency
 total = 0
 for (b, n) in sorted(book_dist.items(), key=lambda x: (-x[1], x[0])):
 print('{:<10} has {:>5} occurrences in {}'.format(lexeme, n, b))
 total += n
 print('{:<10} has {:>5} occurrences in {}'.format(lexeme, total, 'the whole Bible'))

Here we produce results for lexeme `>LHJM/` and formats `ec` and `ha`.

In [47]:
lexeme = '>LHJM/'
show_stats(lexeme)
lex_info(lexeme, 'ec')
lex_info(lexeme, 'ha')

>LHJM/ has 374 occurrences in Kumbukumbu_la_Torati = Deuteronomy
>LHJM/ has 365 occurrences in Zaburi = Psalms
>LHJM/ has 219 occurrences in Mwanzo = Genesis
>LHJM/ has 203 occurrences in 2_Mambo_ya_Nyakati = 2_Chronicles
>LHJM/ has 145 occurrences in Yeremia = Jeremiah
>LHJM/ has 139 occurrences in Kutoka = Exodus
>LHJM/ has 118 occurrences in 1_Mambo_ya_Nyakati = 1_Chronicles
>LHJM/ has 107 occurrences in 1_Wafalme = 1_Kings
>LHJM/ has 100 occurrences in 1_Samweli = 1_Samuel
>LHJM/ has 98 occurrences in 2_Wafalme = 2_Kings
>LHJM/ has 94 occurrences in Isaya = Isaiah
>LHJM/ has 76 occurrences in Yoshua = Joshua
>LHJM/ has 73 occurrences in Waamuzi = Judges
>LHJM/ has 70 occurrences in Nehemia = Nehemiah
>LHJM/ has 55 occurrences in Ezra = Ezra
>LHJM/ has 54 occurrences in 2_Samweli = 2_Samuel
>LHJM/ has 53 occurrences in Mambo_ya_Walawi = Leviticus
>LHJM/ has 40 occurrences in Mhubiri = Ecclesiastes
>LHJM/ has 36 occurrences in Ezekieli = Ezekiel
>LHJM/ has 27 occurrences in Hesabu = 

# Results
[etcbc consonantal](occurrences_>LHJMo.ec.csv)
and
[fully pointed hebrew](occurrences_>LHJMo.ha.csv).

Screenshot made in the Numbers program:



In [48]:
print(open(of_path_template.format(lex_file_name(lexeme), 'ec')).read()[0:1000])

passage	phrase_node	phrase_text	phrase_gloss	phrase_function	lexeme	occ_text	occ_node
Genesis 1:1	605135	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	3
Genesis 1:2	605145	RWX >LHJM 	wind god(s)	Subj	>LHJM/	>LHJM 	25
Genesis 1:3	605150	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	33
Genesis 1:4	605158	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	41
Genesis 1:4	605164	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	49
Genesis 1:5	605168	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	59
Genesis 1:6	605184	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	80
Genesis 1:7	605194	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	96
Genesis 1:8	605208	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	126
Genesis 1:9	605220	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	141
Genesis 1:10	605232	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	161
Genesis 1:10	605241	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	175
Genesis 1:11	605246	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	180
Genesis 1:12	605277	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	224
Genesis 1:14	605289	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	237
Genesis 1:16	605309	>LHJM 	god(s)	Subj	>LHJM/	>LHJM 	283
Genesis 1:17